Enriched Parquet for Reproducible Experiments¶
Rasteret's Parquet index is extensible. You can add columns (AOI polygons, train/val/test splits, labels, quality flags) and query them later with standard Arrow-compatible tools. Everything stays in one file, making experiments reproducible and shareable.
The pattern:
- Rasteret builds the index (scene metadata + COG header cache)
- You enrich it with experiment-specific columns
- Arrow tools filter the enriched Parquet (DuckDB, PyArrow, GeoPandas)
- Rasteret fetches COG pixels for the filtered subset
All data flows through Arrow tables. No conversion to Python lists.
1. Build and enrich¶
from pathlib import Path
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import shapely
from shapely.geometry import Polygon
import rasteret
# Build collection from STAC
collection = rasteret.build_from_stac(
name="bangalore",
stac_api="https://earth-search.aws.element84.com/v1",
collection="sentinel-2-l2a",
bbox=(77.5, 12.9, 77.7, 13.1),
date_range=("2024-01-01", "2024-06-30"),
workspace_dir=Path.home() / "rasteret_workspace",
)
# Get the Arrow table
enriched = collection.dataset.to_table()
n = enriched.num_rows
Add an AOI column¶
Store the AOI polygon as WKB binary. Parquet dictionary-encodes repeated values, so storing the same AOI on every row costs almost nothing.
aoi = Polygon([
(77.55, 13.01), (77.58, 13.01),
(77.58, 13.08), (77.55, 13.08),
(77.55, 13.01),
])
# shapely -> WKB bytes -> Arrow binary column (no Python loop)
aoi_wkb = shapely.to_wkb(aoi)
enriched = enriched.append_column(
"aoi", pa.array([aoi_wkb] * n, type=pa.binary())
)
Add splits and labels¶
rng = np.random.default_rng(42)
splits = rng.choice(["train", "val", "test"], size=n, p=[0.7, 0.15, 0.15])
enriched = enriched.append_column("split", pa.array(splits))
labels = rng.integers(0, 5, size=n)
enriched = enriched.append_column("label", pa.array(labels, type=pa.int32()))
Save¶
pq.write_table(enriched, "./experiment_v1.parquet")
collection = rasteret.load("./experiment_v1.parquet")
The enriched Parquet now contains scene metadata, COG tile cache, AOI geometry, splits, and labels in a single portable Collection.
2. Query with DuckDB¶
Reload the collection later and query with DuckDB. DuckDB reads Arrow tables directly with zero copy.
import duckdb
import rasteret
collection = rasteret.load("./experiment_v1.parquet")
enriched = collection.dataset.to_table()
con = duckdb.connect()
result = con.sql("""
SELECT DISTINCT aoi
FROM enriched
WHERE split = 'train' AND "eo:cloud_cover" < 15
""").fetch_arrow_table()
# Pass the Arrow WKB column directly to Rasteret (no Shapely needed)
aoi_col = result.column("aoi")
The Arrow column goes straight to Rasteret -- no conversion required:
train = collection.subset(split="train", cloud_cover_lt=15)
ds = train.get_xarray(geometries=aoi_col, bands=["B04", "B08"])
ndvi = (ds.B08 - ds.B04) / (ds.B08 + ds.B04)
More DuckDB queries¶
# Monthly breakdown by split
con.sql("""
SELECT split, month, count(*) AS scenes,
round(avg("eo:cloud_cover"), 1) AS avg_cloud
FROM enriched
GROUP BY split, month
ORDER BY split, month
""").show()
# Scenes per AOI (when using multiple AOIs)
con.sql("""
SELECT aoi, split, count(*) AS scenes
FROM enriched
GROUP BY aoi, split
""").show()
3. Query with PyArrow only¶
If you prefer no extra dependencies, PyArrow compute works on the same Arrow table:
import pyarrow.compute as pc
import rasteret
collection = rasteret.load("./experiment_v1.parquet")
enriched = collection.dataset.to_table()
# Filter: train split, low cloud
mask = pc.and_(
pc.equal(enriched.column("split"), "train"),
pc.less(enriched.column("eo:cloud_cover"), 15.0),
)
filtered = enriched.filter(mask)
# Deduplicate AOIs in Arrow, pass directly to Rasteret
unique_wkb = filtered.column("aoi").unique()
# Fetch -- Arrow WKB column goes directly, no Shapely conversion
train = collection.subset(split="train", cloud_cover_lt=15)
ds = train.get_xarray(geometries=unique_wkb, bands=["B04", "B03", "B02"])
4. Query with GeoPandas¶
GeoPandas reads WKB columns from Arrow and gives you spatial operations (intersection, buffer, distance) for free:
import geopandas as gpd
import rasteret
collection = rasteret.load("./experiment_v1.parquet")
enriched = collection.dataset.to_table()
gdf = gpd.GeoDataFrame(
enriched.to_pandas(),
geometry=gpd.GeoSeries.from_wkb(enriched.column("aoi").to_pandas()),
)
# Spatial query: scenes whose AOI intersects a new region
from shapely.geometry import box
new_region = box(77.55, 13.01, 77.60, 13.05)
hits = gdf[gdf.intersects(new_region)]
aois = list(hits.geometry.unique())
Multiple AOIs per experiment¶
When an experiment uses several AOIs, assign each scene to the AOI it belongs to. Parquet compresses repeated WKB values efficiently.
import shapely
from shapely.geometry import Polygon
aoi_north = Polygon([(77.55, 13.05), (77.58, 13.05), (77.58, 13.08), (77.55, 13.08), (77.55, 13.05)])
aoi_south = Polygon([(77.55, 12.95), (77.58, 12.95), (77.58, 13.00), (77.55, 13.00), (77.55, 12.95)])
# Scene footprints from the existing geometry column
footprints = shapely.from_wkb(
enriched.column("geometry").to_numpy(zero_copy_only=False)
)
# Assign AOI based on intersection
aoi_assignments = []
for fp in footprints:
if shapely.intersects(fp, aoi_north):
aoi_assignments.append(shapely.to_wkb(aoi_north))
elif shapely.intersects(fp, aoi_south):
aoi_assignments.append(shapely.to_wkb(aoi_south))
else:
aoi_assignments.append(None)
enriched = enriched.append_column(
"aoi", pa.array(aoi_assignments, type=pa.binary())
)
Summary¶
| Step | Tool | Arrow-native? |
|---|---|---|
| Build index | rasteret.build_from_stac() |
Arrow dataset output |
| Add columns | pa.Table.append_column() |
Yes |
| Export | collection.export() |
Yes |
| Query | DuckDB / pyarrow.compute / GeoPandas |
Yes (zero-copy reads) |
| Fetch pixels | collection.get_xarray(geometries=arrow_col) |
Yes -- Arrow WKB/GeoArrow direct |
Rasteret accepts Arrow columns, WKB bytes, Shapely geometries, bbox tuples, and GeoJSON dicts. Arrow columns are the zero-copy preferred path.
Rasteret builds the index. You own the enrichment. Arrow tools query it. Rasteret fetches the pixels.