Design Decisions¶
This page documents the design choices behind Rasteret's collection-first workflow.
The Architecture: Control Plane vs. Data Plane¶
We follow an index-first retrieval architecture:
- Control plane (The Index): A local Parquet table for discovery, filtering, and experiment metadata (splits/labels).
- Data plane (The Cloud): On-demand pixel streaming directly from source GeoTIFFs.
This separation lets you filter and enrich large raster inventories as tables, then touch pixel bytes only when a read path needs them.
Why Parquet indexes?¶
Traditional geospatial workflows often pay a cold-start cost every time they open remote COGs. GDAL/rasterio needs TIFF header metadata before it can know where tile bytes live. Rasteret caches that header metadata in the collection.
We chose Parquet (over Zarr, JSON, or SQLite) for its specialized ecosystem support: - Queryable: Filter millions of scenes using Arrow predicate pushdown without reading the whole file. - Ecosystem Native: Your index is a standard table. You can open it in DuckDB, Polars, or pandas to perform complex joins between your ground-truth data and your imagery. - Portable: a small Parquet index can describe much larger raster assets. Teams can share the index without copying the pixels.
Why a custom COG reader?¶
No existing TIFF reader exposes a Python API to inject pre-cached tile offsets and byte counts. Rasteret needs to skip the IFD fetch entirely because tile layout is already cached in Parquet, so it needs its own read path.
The reader is minimal: HTTP range requests, TIFF tag parsing, and tile decompression in a thread pool. It is intentionally not a general-purpose TIFF library.
Why COGs stay COGs¶
Rasteret does not convert COGs to Zarr, HDF5, or any other format. The data stays where it is, in the format it was published in. Concretely:
- GIS tools (QGIS, GDAL, rasterio) still work on the same files.
- No data duplication or format lock-in.
- Index can be rebuilt at any time from the original source.
The Parquet index is metadata only. It tells Rasteret where to read, not what to read.
No GDAL/rasterio in the hot path¶
Rasterio is a dependency and is used for reprojection and geometry masking. CRS and coordinate transforms use pyproj directly. CRS encoding in xarray output uses pyproj's CF conventions (WKT2, PROJJSON, GeoTransform), not rioxarray. The tile-read hot path uses Rasteret's own async HTTP + decompression pipeline.
Native dtype preservation¶
Rasteret preserves the native COG dtype through the entire read pipeline.
A uint16 Sentinel-2 tile comes back as uint16; an int8 embedding comes
back as int8. TorchGeo's RasterDataset converts images to float32 by
default; Rasteret keeps native dtypes unless you opt into casting/resampling.
When does promotion happen?
- NaN fill is explicit: Rasteret does not auto-promote integer arrays just to use NaN. If a caller requests NaN fill, the output dtype must be floating (or the call will fail when casting the fill value).
- Cross-CRS reprojection: rasterio's
reproject()operates on the chosen output dtype; if a workflow uses NaN fill, float output is generally required.
In both cases, the promotion is explicit at the call site (not an implicit side effect of masking).
obstore as the URL routing backend¶
Rasteret uses obstore as the HTTP transport layer for multi-cloud URL routing. Rasteret's byte-range read logic, tile decode, and concurrency are custom; obstore replaced direct aiohttp to gain unified S3/GCS/Azure routing without adding per-cloud clients.
Why obstore as a hard dependency rather than optional?
- Multi-cloud + auth: Rasteret has custom IO for byte range reads. obstore gives a single, well-tested interface for S3/GCS/Azure/HTTPS plus credential providers (requester-pays, SAS signing, Earthdata, etc.).
- Single code path: one backend means fewer branches to test and maintain.
- Multi-cloud: Rasteret auto-routes URLs to native cloud stores -- S3Store
for
s3://and*.s3.*.amazonaws.com, GCSStore forgs://andstorage.googleapis.com, AzureStore for*.blob.core.windows.net, and HTTPStore for everything else. Authenticated reads use obstore credential providers viacreate_backend(). - Well-maintained lineage: obstore wraps the same Rust crate
object_storethat Databricks, InfluxDB, and many in Arrow ecosystem depend on.
Decoupled index and read layers¶
The Parquet index is the stable contract. The read layer (COGReader)
is swappable via the StorageBackend protocol. Concretely:
- Custom backends (e.g. a pre-configured
S3Store) can be plugged in without changing the index format. - The index can be consumed by tools other than Rasteret (DuckDB, DataFusion).
- Contributors can work on ingest drivers and readers independently.
Schema flexibility¶
The Collection Parquet has three tiers of columns:
- Required (4 columns):
id,datetime,geometry,assets. Every ingest path must produce these. - COG acceleration (per-band):
{band}_metadatastruct columns with tile offsets, byte counts, transform, etc. These enable fast tiled reads and are added by COG enrichment phase methods. - User-extensible:
split,label,cloud_cover, custom metadata. Anything the user's workflow needs.
This layered schema means the same Parquet file serves as both an I/O acceleration cache and an experiment metadata store.
See Schema Contract for the full column specification.
See also the Collection API reference.
The index is shareable¶
The Parquet index is small (megabytes) while the data it describes is large (terabytes). This makes the Collection practical to share:
- Team workflows: one person indexes 10,000 scenes, team members load the same index file and get identical accelerated reads.
- Academic Paper reproduction: share a few MB Parquet file instead of terabytes of imagery. A collaborator loads it, reads pixels from the same remote COGs, and gets the same results.
- CI/CD: commit the index to version control. The training pipeline loads it directly, no STAC queries, no re-indexing.
- Experiment management: different Parquet files for different experiments (different AOIs, different split assignments, different label columns) over the same underlying data.
Important caveat: The index is a snapshot of metadata at build time. If the underlying TIFF files change (reprocessed, moved to a different URL, deleted), the index becomes stale and needs to be rebuilt. Signed URLs also expire. Stable public URLs (S3, GCS, Azure public buckets) work best for long-lived indexes.
Tiled GeoTIFFs only¶
Rasteret's tile execution path expects tiled GeoTIFFs (COGs recommended for remote access). Striped/untiled TIFFs and non-TIFF formats are not supported in the read path. Use TorchGeo or rasterio directly for those.
This is an intentional scope constraint, not a limitation to fix. Tiled GeoTIFFs are the format where Rasteret's index-first approach provides the most value.