Skip to content

Build a Collection from Parquet Files

build_from_table() creates a Collection from any Parquet file that contains GeoTIFF URLs: STAC GeoParquet, Source Cooperative exports, or your own custom catalog. No STAC API needed.

PyArrow reads the file from local paths, s3://, or gs:// URIs. Rasteret validates the schema, derives per-record bounding boxes from the GeoParquet geometry column, and produces a standard Collection backed by Arrow.


Supported sources

Source Example URI
Source Cooperative s3://us-west-2.opendata.source.coop/maxar/maxar-opendata/maxar-opendata.parquet
STAC GeoParquet exports (Planetary Computer, Element84, ...) s3://sentinel-cogs/sentinel-s2-l2a-cogs/items.parquet
Your own Parquet with GeoTIFF URLs s3://my-bucket/my-catalog.parquet or /local/path.parquet

Any Parquet file works as long as it has the four required columns: id, datetime, geometry, assets (where assets contains GeoTIFF/COG URLs). See the Schema Contract for details.


Build from a remote Parquet

import os

os.environ["AWS_NO_SIGN_REQUEST"] = "YES"  # for public S3 buckets

import rasteret

# Source Cooperative - reads directly from S3 via PyArrow
collection = rasteret.build_from_table(
    "s3://us-west-2.opendata.source.coop/maxar/maxar-opendata/maxar-opendata.parquet",
    name="maxar-opendata",
)

print(f"Rows: {collection.dataset.count_rows()}")

When name is provided, the collection is cached to ~/rasteret_workspace/{name}_records/ and discoverable via rasteret collections list. Subsequent calls with the same name load from the local collection instantly. Pass force=True to rebuild.

See build_from_table() API reference.

Note

build_from_table() uses PyArrow's dataset API internally, which supports local paths, s3://, and gs:// URIs. HTTPS URLs are not supported by PyArrow's scanner. Download HTTPS files locally first, or use an S3/GCS URI when available.


Filter during scan

For large remote files, pass filter_expr and columns to push filtering and projection down to the scan layer (only matching row groups are transferred):

import pyarrow.dataset as ds

collection = rasteret.build_from_table(
    "s3://my-bucket/stac-items.parquet",
    name="filtered",
    filter_expr=ds.field("eo:cloud_cover") < 20.0,
    columns=["id", "datetime", "geometry", "assets", "eo:cloud_cover"],
)

Column mapping

If the source Parquet uses different column names, remap them:

collection = rasteret.build_from_table(
    "path/to/records.parquet",
    name="custom",
    column_map={"scene_id": "id", "timestamp": "datetime"},
)

Rasteret requires four columns: id, datetime, geometry, assets. Everything else is passed through as-is.

The assets column is a mapping from band key -> asset dict. Each asset dict must contain a resolvable href. For multi-sample planar-separate GeoTIFFs (multiple bands in one file), you can also include band_index to select which sample/band the asset refers to.


Enrich with COG headers

By default, build_from_table() imports the Parquet as-is. The resulting Collection works for filtering and metadata queries, but cannot do fast tiled reads (get_xarray(), get_gdf(), to_torchgeo_dataset()) because it has no cached tile offsets.

Pass enrich_cog=True to parse COG headers from the asset URLs during the build. This adds {band}_metadata struct columns to the Parquet index (tile offsets, byte counts, image dimensions, etc.) that enable Rasteret's accelerated reads:

collection = rasteret.build_from_table(
    "s3://my-bucket/my-catalog.parquet",
    name="my-enriched-collection",
    enrich_cog=True,
    band_codes=["B04", "B08"],       # which bands to enrich (optional)
    max_concurrent=300,               # concurrent header fetches
)

band_codes specifies which asset keys to parse. When omitted, Rasteret enriches every asset found in the assets column. For large datasets, specifying only the bands you need saves time and storage.

When do I need enrichment?

Use case enrich_cog needed?
Filtering by time, location, cloud cover No
Exporting / sharing the Collection No
get_xarray(), get_gdf() - reading pixels Yes
to_torchgeo_dataset() - ML training Yes

If you built from a STAC API via build() or build_from_stac(), enrichment already happened automatically. You only need enrich_cog=True when using build_from_table().

Once enriched, use the Collection like any other. geometries accepts Arrow arrays, bbox tuples, Shapely objects, or raw WKB - Arrow columns from GeoParquet are the fastest path (no Python-object conversion):

import pyarrow.parquet as pq

# Arrow geometry column - passed directly, no conversion
parcels = pq.read_table("field_boundaries.geoparquet", columns=["geometry"])
ds = collection.get_xarray(
    geometries=parcels.column("geometry"),
    bands=["B04", "B08"],
)

# Bbox tuple also works for a single area of interest
ds = collection.get_xarray(
    geometries=(77.55, 13.01, 77.58, 13.08),
    bands=["B04", "B08"],
)

For more on what the enrichment columns contain, see Schema Contract - Tier 2: COG acceleration columns.


CLI

rasteret collections import maxar-opendata \
  --record-table "s3://us-west-2.opendata.source.coop/maxar/maxar-opendata/maxar-opendata.parquet"

The full runnable script is at examples/build_collection_from_parquet.py.