Skip to content

Benchmarks

This page records benchmark results for Rasteret's index-first reads against TorchGeo/rasterio, Google Earth Engine, and Hugging Face datasets baselines. Treat the exact numbers as environment-specific; the useful signal is where time is spent in each workflow.

The TorchGeo comparison follows the workflow in docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb.

Environment: Ubuntu Linux, Python 3.13, us-west-2 EC2 instance.

Data source: Sentinel-2 L2A via Earth Search v1, with COGs on S3 in us-west-2.

Controlled variables: same scenes, AOIs, sampler, DataLoader, and stack_samples collate path. Both paths output the same [batch, T, C, H, W] tensor shape.

TorchGeo runs with GDAL settings based on Pangeo COG best practices:

GDAL_DISABLE_READDIR_ON_OPEN = "EMPTY_DIR"
AWS_NO_SIGN_REQUEST = "YES"
GDAL_MAX_RAW_BLOCK_CACHE_SIZE = "200000000"
GDAL_SWATH_SIZE = "200000000"
VSI_CURL_CACHE_SIZE = "200000000"

TorchGeo / rasterio Baseline

TorchGeo's native path reads remote COGs through GDAL/rasterio. Rasteret reads the pre-built collection metadata and fetches only pixel byte ranges.

Cold start:

Scenario TorchGeo/rasterio Rasteret Speedup Shape
Single AOI, 15 scenes 9.08 s 1.14 s 8.0x [2, 15, 1, 256, 256]
Multi-AOI, 30 scenes 42.05 s 2.25 s 18.7x [4, 30, 1, 256, 256]
Cross-CRS, 12 scenes 12.47 s 0.59 s 21.3x [2, 12, 1, 256, 256]

Warm cache, immediate re-run:

Scenario TorchGeo/rasterio Rasteret Speedup Shape
Single AOI, 15 scenes 9.14 s 0.81 s 11.3x [2, 15, 1, 256, 256]
Multi-AOI, 30 scenes 29.68 s 2.60 s 11.4x [4, 30, 1, 256, 256]
Cross-CRS, 12 scenes 3.61 s 1.06 s 3.4x [2, 12, 1, 256, 256]

TorchGeo/rasterio vs Rasteret processing time

Benchmark breakdown

Where The Difference Comes From

Step TorchGeo/rasterio path Rasteret path
Index/header metadata rasterio.open() per COG over HTTP Pre-built Parquet collection metadata
Time-series read Sequential rasterio.merge() per timestep Timesteps/bands fetched with asyncio.gather
HTTP per timestep Header + pixel ranges Pixel ranges, because headers are cached
Concurrency Mostly sequential in this benchmark path Concurrent byte-range reads

Index/header time means:

  • TorchGeo/rasterio: time spent opening remote files and parsing TIFF IFD metadata over HTTP.
  • Rasteret: time to read the pre-built collection index from local storage.

Google Earth Engine / Time-Series Baseline

This separate time-series comparison measures Rasteret against Google Earth Engine and a thread-pooled rasterio path:

Library First run (cold) Subsequent runs (hot)
Rasterio + ThreadPool 32 s 24 s
Google Earth Engine 10-30 s 3-5 s
Rasteret 3 s 3 s

Single time-series request performance

200 COG comparison

Actual analysis time

Hugging Face datasets Baseline

This benchmark compares Rasteret with image-bytes-inside-Parquet workflows using Hugging Face datasets and Major TOM-style keyed patch access.

Patches HF datasets parquet filters Rasteret index + COG Speedup
120 46.83 s 12.09 s 3.88x
1000 771.59 s 118.69 s 6.50x

HF vs Rasteret processing time

HF vs Rasteret speedup

The point is not that images-inside-Parquet is never useful. It is that for large cloud COG collections, Rasteret can keep pixels in the published COGs and use Parquet as the queryable index.

Cost And Scaling Views

The following figures summarize supporting cost/scaling views from the same benchmark asset set.

Environment scaling costs

AWS service-wise costs

Total VM hours

Reproducibility

# Fresh run
uv run python -m nbconvert --execute docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb

# Immediate re-run
uv run python -m nbconvert --execute docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb

Results vary with network conditions, instance placement, cloud credentials, and provider rate limits.

Why Cold Starts Matter

Every new notebook kernel, VM, Kubernetes pod, CI runner, or colleague's fresh environment starts cold. In the rasterio/GDAL path, remote COG headers are re-read to discover tile offsets and byte counts. Rasteret stores that header metadata in the collection, so repeated reads can start from the cached index and go straight to pixel byte ranges.

If you run Rasteret on bigger collections, different sensors, or production pipelines, share timings in GitHub Discussions or Discord.