Building from Parquet¶
Tutorials 01-02 build Collections from a STAC API via build(). But Rasteret
works with any Parquet file that has GeoTIFF URLs; no STAC API needed.
build_from_table() reads Parquet directly from S3, GCS, or local paths.
It validates the schema, derives bounding boxes, and produces a standard
Collection backed by Arrow.
This notebook uses the Maxar Open Data catalog from Source Cooperative: 14,979 sub-meter satellite scenes across 23 disaster events, published as STAC GeoParquet with fully public COGs.
1. Build a Collection from a remote Parquet¶
build_from_table() reads the Parquet from S3 via PyArrow (using obstore
under the hood), validates the four required columns (id, datetime,
geometry, assets), and produces a Collection.
Source Cooperative data lives in a public S3 bucket. Set
AWS_NO_SIGN_REQUEST so PyArrow skips credential lookup.
import os
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
import rasteret
SOURCE_COOP_URI = (
"s3://us-west-2.opendata.source.coop" "/maxar/maxar-opendata/maxar-opendata.parquet"
)
collection = rasteret.build_from_table(SOURCE_COOP_URI, name="maxar-opendata")
print(f"Collection: {collection.name}")
print(f"Scenes: {collection.dataset.count_rows()}")
print(f"Columns: {collection.dataset.schema.names[:8]}...")
2. Explore the Collection with DuckDB¶
The Collection is backed by Arrow. DuckDB reads Arrow tables with zero copy - pass the Python variable directly, no file I/O.
Requires the
examplesextra:pip install rasteret[examples]
import duckdb
# Arrow table from the Collection - this is the variable DuckDB reads
maxar = collection.dataset.to_table()
con = duckdb.connect()
# What disaster events are in this catalog?
con.sql("""
SELECT
replace(
split_part(split_part(assets.visual.href, '/events/', 2), '/ard/', 1),
'-', ' '
) AS event,
count(*) AS scenes,
min(datetime)::date AS earliest,
max(datetime)::date AS latest,
round(avg(gsd), 2) AS avg_gsd_m
FROM maxar
WHERE assets.visual IS NOT NULL
GROUP BY event
ORDER BY scenes DESC
""").show()
3. Filter¶
Rasteret's subset() and where() are convenience methods for common
filters. You can also filter the Arrow table directly with DuckDB,
PyArrow, or pandas - whichever fits your workflow.
from datetime import datetime, timezone
import pyarrow.dataset as ds
# --- Option A: Rasteret convenience filter ---
aug_scenes = collection.where(
(ds.field("datetime") >= datetime(2023, 8, 9, tzinfo=timezone.utc))
& (ds.field("datetime") < datetime(2023, 8, 13, tzinfo=timezone.utc))
)
print(f"Rasteret where(): {aug_scenes.dataset.count_rows()} scenes")
# --- Option B: DuckDB on the Arrow table ---
result = con.sql("""
SELECT count(*) AS scenes
FROM maxar
WHERE datetime >= '2023-08-09' AND datetime < '2023-08-13'
""").fetchone()
print(f"DuckDB filter: {result[0]} scenes")
# --- Option C: PyArrow compute ---
import pyarrow.compute as pc
mask = pc.and_(
pc.greater_equal(
maxar.column("datetime"),
pc.assume_timezone(pc.strptime("2023-08-09", "%Y-%m-%d", "us"), timezone="UTC"),
),
pc.less(
maxar.column("datetime"),
pc.assume_timezone(pc.strptime("2023-08-13", "%Y-%m-%d", "us"), timezone="UTC"),
),
)
print(f"PyArrow filter: {pc.sum(mask).as_py()} scenes")
print("\nAll three query the same Arrow data. Use whichever fits your workflow.")
4. Export and share¶
Export the Collection so a teammate can load it; no S3 access or Source Cooperative account needed on their end.
import tempfile
from pathlib import Path
with tempfile.TemporaryDirectory() as tmpdir:
export_path = Path(tmpdir) / "maxar_collection"
collection.export(export_path)
# Teammate loads it in one line:
reloaded = rasteret.load(export_path)
print(f"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes")
5. Column mapping (non-standard schemas)¶
Not every Parquet file uses STAC column names. If your source uses different
names, provide a column_map:
collection = rasteret.build_from_table(
"s3://my-bucket/my-catalog.parquet",
name="custom",
column_map={"scene_id": "id", "timestamp": "datetime"},
)
Rasteret requires four columns: id, datetime, geometry, assets.
Everything else is passed through as-is. See the
Schema Contract for details.
Summary¶
| Step | What happens |
|---|---|
build_from_table(s3_uri) |
Reads Parquet from S3/GCS/local, validates schema, creates Collection |
collection.dataset.to_table() |
Arrow table - pass directly to DuckDB, PyArrow, pandas |
collection.where() / subset() |
Convenience filters (Arrow pushdown) |
collection.export() -> rasteret.load() |
Share a portable Collection |
When to use which build function:
| Situation | Use |
|---|---|
| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | rasteret.build() |
| Custom STAC API not in the catalog | rasteret.build_from_stac() |
| Existing Parquet with GeoTIFF URLs (this notebook) | rasteret.build_from_table() |
| Someone shared a Collection with you | rasteret.load() |
Next: Parquet Filtering