Rasteret¶

Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.

The cold-start tax

Your colleague read those Sentinel-2 scenes last Tuesday. The tools re-parsed every file header over HTTP - per scene, per band. So did CI. So did the intern's notebook. PyTorch respawns DataLoader workers every epoch, so your own training run re-parses them hundreds of times over.

A single project repeats millions of redundant requests across a team - zero pixels delivered.

What Rasteret does

Parse headers once, cache in Parquet, read pixels concurrently with no GDAL in the path.

STAC API / GeoParquet  -->  Parquet Index  -->  Tile-level byte reads
       (once)                 (queryable)          (no GDAL, no headers)

No cold-start penalty

Header metadata lives in Parquet, not behind HTTP. New kernel? New pod? Same speed as the tenth run.
Zero downloads

Work with terabytes of cloud imagery while storing only megabytes of metadata locally.
No STAC at training time

Query STAC once at setup. Zero API calls during training loops, no rate-limiting risk.
Reproducible and shareable

Same Parquet index = same records = same results. Share a 5 MB file and collaborators skip re-indexing.

Dataset catalog¶

Rasteret ships with 12 built-in datasets: Sentinel-2, Landsat, NAIP, Copernicus DEM, ESRI Land Cover, ESA WorldCover, USDA CDL, ALOS DEM, NASADEM, and AlphaEarth Foundation embeddings across Earth Search, Planetary Computer, and AEF.

rasteret datasets list          # CLI

for d in rasteret.DatasetRegistry.list():
    print(d.id, d.name)         # Python

The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset and every user gets access on the next release. No proprietary APIs, no platform lock-in. See Dataset Catalog for details, or Design Decisions for the thinking behind it.

Pick any ID and pass it to build(). For datasets not in the catalog, use build_from_stac() or build_from_table().

New here?

Start with Getting Started, then run Tutorial 01 - Quickstart: xarray.

How it works¶

import rasteret

# 1. Build index (one-time, cached)
collection = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", bbox=(...), date_range=(...))
collection.bands   # ['B01', 'B02', ..., 'B12', 'SCL']

# 2. Filter metadata (in-memory, instant)
sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))

# 3. Read pixels - pass a bbox, Arrow column, Shapely geometry, or WKB
ds = sub.get_xarray(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])
dataset = sub.to_torchgeo_dataset(bands=["B04", "B03", "B02"])    # TorchGeo

build() picks from a growing catalog of pre-registered datasets across Earth Search and Planetary Computer. For existing Parquet files - Source Cooperative exports, STAC GeoParquet, or your own catalog - use build_from_table(). For multi-band COGs like AlphaEarth Foundation embeddings, use band_index in the asset dict to select individual bands from a shared file. For custom STAC APIs not in the catalog, use build_from_stac(). See the API Reference for full method signatures.

Benchmarks¶

Rasteret vs TorchGeo benchmark

These are cold-start numbers: no HTTP cache, no OS page cache. Every new notebook kernel, VM, k8s pod, or CI runner starts cold. That is the real-world scenario, and where Rasteret's Parquet index matters most.

For full methodology and numbers, see Benchmarks.

Share your speed-ups

Running Rasteret on your own data? We'd love to hear your numbers. Post in Show and Tell or drop them in Discord.

Scope¶

Optimized for remote, tiled GeoTIFFs (COGs), where the biggest speedups happen.
Works with local tiled GeoTIFFs too. Speedups are smaller without network overhead, but the Parquet index is still useful for organizing, filtering, and sharing collections.
Non-tiled GeoTIFFs and non-TIFF formats (NetCDF, HDF5) are best handled by TorchGeo or rasterio directly.
CRS is encoded via CF conventions (pyproj); no rioxarray dependency.

Rasteret is an opt-in accelerator: RasteretGeoDataset is a standard TorchGeo GeoDataset subclass that honors the full contract (index, crs, res, __getitem__). Samplers, DataLoader, transforms, and dataset composition (IntersectionDataset, UnionDataset) all work unchanged. Rasteret replaces the I/O backend, not the training interface. For how Rasteret relates to other tools, see Ecosystem Comparison.

Getting Started

Installation and first steps.

Get started
Tutorials

Hands-on notebooks for learning Rasteret.

Tutorials
How-To Guides

Task-oriented recipes for common workflows.

How-to guides
API Reference

Auto-generated from source code docstrings.

API reference
Explanation

Architecture, design decisions, and ecosystem context.

Explanation
Contributing

Add a dataset, improve docs, or build something new. All contributions are welcome.

Contributing