Ecosystem Comparison¶
Rasteret accelerates reads from tiled GeoTIFF collections by caching tile layout metadata in a Parquet index. It works alongside TorchGeo, xarray, and rasterio, not instead of them.
Collections are written as GeoParquet 1.1 today (WKB + geo metadata).
Parquet-native GEOMETRY/GEOGRAPHY logical types and GeoParquet 2.0 are
emerging; Rasteret tracks this and plans to adopt newer encodings when ecosystem
support stabilizes.
Interop¶
TorchGeo¶
Collection.to_torchgeo_dataset() returns a standard TorchGeo
GeoDataset. Your samplers,
DataLoader, and training loop do not change.
This is pipeline-level interop: Rasteret provides a TorchGeo dataset object
that plugs into TorchGeo's samplers and transforms, while Rasteret remains the
pixel I/O backend. TorchGeo's own RasterDataset still reads via rasterio/GDAL
and remains the right tool when Rasteret's COG/tile constraints don't apply.
dataset = collection.to_torchgeo_dataset(bands=["B04", "B03", "B02"], chip_size=256)
sampler = RandomGeoSampler(dataset, size=256, length=100)
loader = DataLoader(dataset, sampler=sampler, batch_size=4, collate_fn=stack_samples)
GeoDataset contract¶
RasteretGeoDataset subclasses TorchGeo's GeoDataset and honors the
full contract that samplers and dataset composition rely on:
| Surface | What Rasteret does |
|---|---|
__getitem__(GeoSlice) -> Sample |
Returns {"image": Tensor, "bounds": Tensor, "transform": Tensor} (or "mask" when is_image=False) |
index |
GeoPandas GeoDataFrame with IntervalIndex named "datetime" and Shapely footprint geometry |
crs |
Set from the collection's EPSG code via CRS.from_epsg() |
res |
Derived from the first record's COG metadata transform |
| Samplers | Works with RandomGeoSampler, GridGeoSampler, and any sampler that reads bounds, index, and res |
| Dataset composition | Works with IntersectionDataset and UnionDataset; the index is designed so reset_index() does not conflict |
Rasteret replaces the I/O backend (async obstore instead of rasterio/GDAL) but speaks the same interface. Nothing downstream of the dataset object needs to change.
Rasteret additions¶
These are features Rasteret adds on top of the GeoDataset contract. They do not break interop because TorchGeo ignores unknown sample keys, and constructor parameters are Rasteret-specific.
| Feature | What it does | Interop impact |
|---|---|---|
label_field |
Adds sample["label"] from a metadata column |
None: extra key, ignored by TorchGeo trainers |
time_series=True |
Stacks all spatially overlapping records into [T, C, H, W] |
None: standard tensor shape, works with TorchGeo transforms |
target_crs= |
Reprojects scenes from different CRS zones on the fly | None: result has uniform CRS, transparent to samplers |
cloud_config= |
Configures authenticated cloud reads (requester-pays, signed URLs) | None: constructor-level, transparent to samplers |
allow_resample=True |
Resamples bands with different native resolutions onto a common grid | None: output tensor has uniform resolution |
Behavior details¶
Rasteret preserves the native COG dtype (e.g., uint16 for Sentinel-2)
whereas TorchGeo converts to float32 by default (via its dtype property).
Multi-CRS scenes are auto-reprojected to a common CRS using GDAL's
calculate_default_transform for correct resolution handling.
Rasteret's read pipeline can produce a valid_mask (boolean) so ML workflows
can distinguish filled pixels from real source data. The TorchGeo adapter keeps
samples TorchGeo-standard by default and does not include valid_mask.
For mask-style datasets, pass is_image=False to return sample["mask"]
instead of sample["image"] (single-band data squeezes the channel
dimension, matching TorchGeo RasterDataset conventions).
If requested bands have different resolutions, Rasteret fails fast by default.
To opt into resampling bands onto a common grid in the TorchGeo adapter, pass
allow_resample=True to Collection.to_torchgeo_dataset(...).
When records in a collection have different native resolutions, Rasteret warns at dataset creation time. The read path resamples each tile to the query grid correctly regardless.
See Tutorial 02 and Tutorial 05.
xarray / GeoPandas¶
Rasteret handles the I/O (async byte-range reads via obstore), then hands off to standard xarray and GeoPandas objects for analysis:
Collection.get_xarray(...)returns anxr.DatasetCollection.get_gdf(...)returns agpd.GeoDataFrame
See Tutorial 01.
CRS encoding¶
xarray output uses CF conventions via pyproj (no rioxarray dependency):
spatial_refcoordinate with WKT2 (ISO 19162:2019), PROJJSON, and CF grid-mapping attributesGeoTransformattribute for GDAL-compatible tools- Pixel-center coordinates (half-pixel offset from tile origin)
Code that uses ds.rio.* methods will need to pip install rioxarray
separately. The spatial_ref coordinate written by Rasteret is compatible
with rioxarray if installed.
Data types¶
Band arrays return in the native COG dtype. For Sentinel-2 L2A, that is
uint16 (surface reflectance values 0-10000). Geometry masking fills
outside-AOI / outside-coverage pixels with the COG nodata value when
present, otherwise 0, preserving native dtype. For ML workloads that
should avoid learning from filled pixels, use the valid_mask returned
by Rasteret reads.
Multi-CRS¶
When a query spans records from multiple CRS zones (e.g., adjacent UTM
zones), Rasteret auto-detects this and reprojects all tiles to the most
common CRS before merging. A warning is logged. Pass target_crs= to
get_xarray() or get_gdf() to override.
rasterio¶
Rasteret uses rasterio for geometry masking (rasterio.features.geometry_mask),
multi-CRS reprojection (rasterio.warp.reproject), and TorchGeo query-grid
placement (rasterio.merge.merge via rio_semantics.py). CRS transforms and
coordinate operations use pyproj directly. Tile reads go through Rasteret's
own async pipeline backed by obstore. No GDAL in the tile-read path.
CRS encoding in xarray output uses pyproj's CF conventions (CRS.to_cf(),
CRS.to_wkt(), CRS.to_json()), not rioxarray.
Alternative approaches¶
These libraries solve related problems with different designs:
GeoParquet "Parquet Raster" (alpha/WIP): a draft specification for storing raster payloads (and/or external raster references) in Parquet (draft spec). Rasteret is different: it uses GeoParquet as a record table/index and reads pixel tiles from existing GeoTIFF/COG assets via byte-range I/O. If Parquet Raster stabilizes, it may become an interop/export target, but it is not what Rasteret writes today.
TACO / tacoTIFF: packaging-first (materializes data into a TACO layout
with a level0.parquet manifest). Rasteret is indexing-first (indexes
existing tiled GeoTIFFs in place, no data copying). The approaches are
complementary: Rasteret's
DatasetDescriptor can point to a TACO level0.parquet via geoparquet_uri, and
build_from_table() can ingest it like any other Parquet source. As TACO
matures, deeper interop (e.g. layout-aware reads) is a natural extension.
async-geotiff / async-tiff: fast low-level async GeoTIFF readers. Interop with Rasteret is possible by replacing the tile-reading layer, but they don't yet support passing pre-parsed IFD metadata.
virtual-tiff: oriented towards making TIFF data accessible to the Zarr ecosystem by exposing tiles as Zarr-compatible chunks. Rasteret reads tiles directly via byte-range requests using a Parquet index of tile-layout metadata.
When to use what¶
| Your data | Recommendation |
|---|---|
| Cloud-hosted tiled GeoTIFFs (Sentinel-2, Landsat, etc.) | Rasteret (over 20x faster) |
| Local tiled GeoTIFFs | Rasteret works; speedup is smaller, but the index is still useful for filtering and sharing |
| Non-tiled GeoTIFFs (striped layout) | TorchGeo / rasterio |
| Non-TIFF formats (NetCDF, HDF5, GRIB) | TorchGeo / rasterio |
Testing¶
The test suite includes pixel-level comparisons against direct rasterio
reads for the xarray, GeoDataFrame, and TorchGeo output paths. The TorchGeo
comparison uses rasterio.merge.merge as the oracle, matching what TorchGeo's
own _merge_or_stack calls. Coverage spans 12 datasets including Sentinel-2,
Landsat, NAIP, Copernicus DEM, ESA WorldCover, and AEF (south-up). See
test_dataset_pixel_comparison.py (requires --network), plus
test_public_network_smoke.py, test_torchgeo_network.py, and
test_network_smoke.py.
If you encounter edge cases where output differs from rasterio, please file an issue.