Build From Parquet And Arrow Tables¶
Turn a Parquet or Arrow table of COG records into a read-ready Rasteret Collection.
For the full written guide, see Build from Parquet and Arrow Tables.
1. Build a Collection from a remote Parquet¶
build_from_table() reads the Parquet from S3 via PyArrow, validates the four required columns (id, datetime,
geometry, assets), and produces a Collection.
Source Cooperative data lives in a public S3 bucket. Set
AWS_NO_SIGN_REQUEST so PyArrow skips credential lookup.
import os
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
import rasteret
SOURCE_COOP_URI = (
"s3://us-west-2.opendata.source.coop" "/maxar/maxar-opendata/maxar-opendata.parquet"
)
collection = rasteret.build_from_table(SOURCE_COOP_URI, name="maxar-opendata")
print(f"Collection: {collection.name}")
print(f"Scenes: {collection.dataset.count_rows()}")
print(f"Columns: {collection.dataset.schema.names[:8]}...")
2. Explore the Collection with DuckDB¶
The Collection is backed by Arrow. DuckDB reads Arrow tables with zero copy - pass the Python variable directly, no file I/O.
Requires
duckdb:pip install duckdb
import duckdb
# Arrow table from the Collection - this is the variable DuckDB reads
maxar = collection.dataset.to_table()
con = duckdb.connect()
# What disaster events are in this catalog?
con.sql("""
SELECT
replace(
split_part(split_part(assets.visual.href, '/events/', 2), '/ard/', 1),
'-', ' '
) AS event,
count(*) AS scenes,
min(datetime)::date AS earliest,
max(datetime)::date AS latest,
round(avg(gsd), 2) AS avg_gsd_m
FROM maxar
WHERE assets.visual IS NOT NULL
GROUP BY event
ORDER BY scenes DESC
""").show()
3. Filter¶
Rasteret's subset() and where() are convenience methods for common
filters. You can also filter the Arrow table directly with DuckDB,
PyArrow, or pandas - whichever fits your workflow.
from datetime import datetime, timezone
import pyarrow.dataset as ds
# --- Option A: Rasteret convenience filter ---
aug_scenes = collection.where(
(ds.field("datetime") >= datetime(2023, 8, 9, tzinfo=timezone.utc))
& (ds.field("datetime") < datetime(2023, 8, 13, tzinfo=timezone.utc))
)
print(f"Rasteret where(): {aug_scenes.dataset.count_rows()} scenes")
# --- Option B: DuckDB on the Arrow table ---
result = con.sql("""
SELECT count(*) AS scenes
FROM maxar
WHERE datetime >= '2023-08-09' AND datetime < '2023-08-13'
""").fetchone()
print(f"DuckDB filter: {result[0]} scenes")
# --- Option C: PyArrow compute ---
import pyarrow.compute as pc
mask = pc.and_(
pc.greater_equal(
maxar.column("datetime"),
pc.assume_timezone(pc.strptime("2023-08-09", "%Y-%m-%d", "us"), timezone="UTC"),
),
pc.less(
maxar.column("datetime"),
pc.assume_timezone(pc.strptime("2023-08-13", "%Y-%m-%d", "us"), timezone="UTC"),
),
)
print(f"PyArrow filter: {pc.sum(mask).as_py()} scenes")
print("\nAll three query the same Arrow data. Use whichever fits your workflow.")
4. Export and share¶
Export the Collection so a teammate can load it; no S3 access or Source Cooperative account needed on their end.
import tempfile
from pathlib import Path
with tempfile.TemporaryDirectory() as tmpdir:
export_path = Path(tmpdir) / "maxar_collection"
collection.export(export_path)
# Teammate loads it in one line:
reloaded = rasteret.load(export_path)
print(f"Loaded: {reloaded.name}, {reloaded.dataset.count_rows()} scenes")
5. Column mapping (non-standard schemas)¶
Not every Parquet file uses STAC column names. If your source uses different
names, provide a column_map:
collection = rasteret.build_from_table(
"s3://my-bucket/my-catalog.parquet",
name="custom",
column_map={"scene_id": "id", "timestamp": "datetime"},
)
Rasteret requires four columns: id, datetime, geometry, assets.
Everything else is passed through as-is. See the
Schema Contract for details.
Summary¶
| Step | What happens |
|---|---|
build_from_table(s3_uri) |
Reads Parquet from S3/GCS/local, validates schema, creates Collection |
collection.dataset.to_table() |
Arrow table - pass directly to DuckDB, PyArrow, pandas |
collection.where() / subset() |
Convenience filters (Arrow pushdown) |
collection.export() -> rasteret.load() |
Share a portable Collection |
When to use which build function:
| Situation | Use |
|---|---|
| Dataset in the catalog (Sentinel-2, Landsat, NAIP, ...) | rasteret.build() |
| Custom STAC API not in the catalog | rasteret.build_from_stac() |
| Existing Parquet with GeoTIFF URLs (this notebook) | rasteret.build_from_table() |
| Someone shared a Collection with you | rasteret.load() |