Skip to content

Rasteret

Made to beat cold starts. Index-first access to cloud-native GeoTIFF collections for ML and geospatial analysis.


The cold-start tax

Your colleague read those Sentinel-2 scenes last Tuesday. The tools re-parsed every file header over HTTP - per scene, per band. So did CI. So did the intern's notebook. PyTorch respawns DataLoader workers every epoch, so your own training run re-parses them hundreds of times over.

A single project repeats millions of redundant requests across a team - zero pixels delivered.

What Rasteret does

Parse headers once, cache in Parquet, read pixels concurrently with no GDAL in the path.

STAC API / GeoParquet  -->  Parquet Index  -->  Tile-level byte reads
       (once)                 (queryable)          (no GDAL, no headers)

  • No cold-start penalty


    Header metadata lives in Parquet, not behind HTTP. New kernel? New pod? Same speed as the tenth run.

  • Zero downloads


    Work with terabytes of cloud imagery while storing only megabytes of metadata locally.

  • No STAC at training time


    Query STAC once at setup. Zero API calls during training loops, no rate-limiting risk.

  • Reproducible and shareable


    Same Parquet index = same records = same results. Share a 5 MB file and collaborators skip re-indexing.


Dataset catalog

Rasteret ships with 12 built-in datasets: Sentinel-2, Landsat, NAIP, Copernicus DEM, ESRI Land Cover, ESA WorldCover, USDA CDL, ALOS DEM, NASADEM, and AlphaEarth Foundation embeddings across Earth Search, Planetary Computer, and AEF.

rasteret datasets list          # CLI
for d in rasteret.DatasetRegistry.list():
    print(d.id, d.name)         # Python

The catalog is open and community-driven. Each entry is ~20 lines of Python. One PR adds a dataset and every user gets access on the next release. No proprietary APIs, no platform lock-in. See Dataset Catalog for details, or Design Decisions for the thinking behind it.

Pick any ID and pass it to build(). For datasets not in the catalog, use build_from_stac() or build_from_table().


New here?

Start with Getting Started, then run Tutorial 01 - Quickstart: xarray.

How it works

import rasteret

# 1. Build index (one-time, cached)
collection = rasteret.build("earthsearch/sentinel-2-l2a", name="s2", bbox=(...), date_range=(...))
collection.bands   # ['B01', 'B02', ..., 'B12', 'SCL']

# 2. Filter metadata (in-memory, instant)
sub = collection.subset(cloud_cover_lt=20, date_range=("2024-03-01", "2024-06-01"))

# 3. Read pixels - pass a bbox, Arrow column, Shapely geometry, or WKB
ds = sub.get_xarray(geometries=(-122.5, 37.7, -122.3, 37.9), bands=["B04", "B08"])
dataset = sub.to_torchgeo_dataset(bands=["B04", "B03", "B02"])    # TorchGeo

build() picks from a growing catalog of pre-registered datasets across Earth Search and Planetary Computer. For existing Parquet files - Source Cooperative exports, STAC GeoParquet, or your own catalog - use build_from_table(). For multi-band COGs like AlphaEarth Foundation embeddings, use band_index in the asset dict to select individual bands from a shared file. For custom STAC APIs not in the catalog, use build_from_stac(). See the API Reference for full method signatures.

Benchmarks

Rasteret vs TorchGeo benchmark

These are cold-start numbers: no HTTP cache, no OS page cache. Every new notebook kernel, VM, k8s pod, or CI runner starts cold. That is the real-world scenario, and where Rasteret's Parquet index matters most.

For full methodology and numbers, see Benchmarks.

Share your speed-ups

Running Rasteret on your own data? We'd love to hear your numbers. Post in Show and Tell or drop them in Discord.

Scope

  • Optimized for remote, tiled GeoTIFFs (COGs), where the biggest speedups happen.
  • Works with local tiled GeoTIFFs too. Speedups are smaller without network overhead, but the Parquet index is still useful for organizing, filtering, and sharing collections.
  • Non-tiled GeoTIFFs and non-TIFF formats (NetCDF, HDF5) are best handled by TorchGeo or rasterio directly.
  • CRS is encoded via CF conventions (pyproj); no rioxarray dependency.

Rasteret is an opt-in accelerator: RasteretGeoDataset is a standard TorchGeo GeoDataset subclass that honors the full contract (index, crs, res, __getitem__). Samplers, DataLoader, transforms, and dataset composition (IntersectionDataset, UnionDataset) all work unchanged. Rasteret replaces the I/O backend, not the training interface. For how Rasteret relates to other tools, see Ecosystem Comparison.


  • Getting Started


    Installation and first steps.

    Get started

  • Tutorials


    Hands-on notebooks for learning Rasteret.

    Tutorials

  • How-To Guides


    Task-oriented recipes for common workflows.

    How-to guides

  • API Reference


    Auto-generated from source code docstrings.

    API reference

  • Explanation


    Architecture, design decisions, and ecosystem context.

    Explanation

  • Contributing


    Add a dataset, improve docs, or build something new. All contributions are welcome.

    Contributing