Rasteret¶
Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.
The cold-start tax
Your colleague read those Sentinel-2 scenes last Tuesday. The tools re-parsed every file header over HTTP - per scene, per band. So did CI. So did the intern's notebook. PyTorch respawns DataLoader workers every epoch, so your own training run re-parses them hundreds of times over.
A single project repeats millions of redundant requests across a team - zero pixels delivered.
What Rasteret does
Parse headers once, cache in Parquet, read pixels concurrently with no GDAL in the path.
-
No cold-start penalty
Header metadata lives in Parquet, not behind HTTP. New kernel? New pod? Same speed as the tenth run.
-
Zero downloads
Work with terabytes of cloud imagery while storing only megabytes of metadata locally.
-
No STAC at training time
Query STAC once at setup. Zero API calls during training loops, no rate-limiting risk.
-
Reproducible and shareable
Same Parquet index = same records = same results. Share a 5 MB file and collaborators skip re-indexing.
Dataset catalog¶
Rasteret ships with 12 built-in datasets: Sentinel-2, Landsat, NAIP, Copernicus DEM, ESRI Land Cover, ESA WorldCover, USDA CDL, ALOS DEM, NASADEM, and AlphaEarth Foundation embeddings across Earth Search, Planetary Computer, and AEF.
The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset and every user gets access on the next release. No proprietary APIs, no platform lock-in. See Dataset Catalog for details, or Design Decisions for the thinking behind it.
Pick any ID and pass it to build(). For datasets not in the catalog, use
build_from_stac() or build_from_table().
New here?
Start with Getting Started, then run Tutorial 01 - Quickstart: xarray.
How it works¶
import rasteret
# 1. Build index (one-time, cached)
collection = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", bbox=(...), date_range=(...))
collection.bands # ['B01', 'B02', ..., 'B12', 'SCL']
# 2. Filter metadata (in-memory, instant)
sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))
# 3. Read pixels - pass a bbox, Arrow column, Shapely geometry, or WKB
ds = sub.get_xarray(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])
dataset = sub.to_torchgeo_dataset(bands=["B04", "B03", "B02"]) # TorchGeo
build() picks from a growing catalog of pre-registered datasets across
Earth Search and Planetary Computer. For existing Parquet
files - Source Cooperative exports, STAC GeoParquet,
or your own catalog - use build_from_table().
For multi-band COGs like AlphaEarth Foundation embeddings,
use band_index in the asset dict to select individual bands from a shared file.
For custom STAC APIs not in the catalog, use build_from_stac().
See the API Reference for full method signatures.
Benchmarks¶

These are cold-start numbers: no HTTP cache, no OS page cache. Every new notebook kernel, VM, k8s pod, or CI runner starts cold. That is the real-world scenario, and where Rasteret's Parquet index matters most.
For full methodology and numbers, see Benchmarks.
Share your speed-ups
Running Rasteret on your own data? We'd love to hear your numbers. Post in Show and Tell or drop them in Discord.
Scope¶
- Optimized for remote, tiled GeoTIFFs (COGs), where the biggest speedups happen.
- Works with local tiled GeoTIFFs too. Speedups are smaller without network overhead, but the Parquet index is still useful for organizing, filtering, and sharing collections.
- Non-tiled GeoTIFFs and non-TIFF formats (NetCDF, HDF5) are best handled by TorchGeo or rasterio directly.
- CRS is encoded via CF conventions (pyproj); no rioxarray dependency.
Rasteret is an opt-in accelerator: RasteretGeoDataset is a standard
TorchGeo GeoDataset subclass that honors the full contract (index, crs,
res, __getitem__). Samplers, DataLoader, transforms, and dataset composition
(IntersectionDataset, UnionDataset) all work unchanged. Rasteret replaces
the I/O backend, not the training interface.
For how Rasteret relates to other tools, see
Ecosystem Comparison.
-
Getting Started
Installation and first steps.
-
Tutorials
Hands-on notebooks for learning Rasteret.
-
How-To Guides
Task-oriented recipes for common workflows.
-
API Reference
Auto-generated from source code docstrings.
-
Explanation
Architecture, design decisions, and ecosystem context.
-
Contributing
Add a dataset, improve docs, or build something new. All contributions are welcome.