Benchmarks¶

This page records benchmark results for Rasteret's index-first reads against TorchGeo/rasterio, Google Earth Engine, and Hugging Face datasets baselines. Treat the exact numbers as environment-specific; the useful signal is where time is spent in each workflow.

The TorchGeo comparison follows the workflow in docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb.

Environment: Ubuntu Linux, Python 3.13, us-west-2 EC2 instance.

Data source: Sentinel-2 L2A via Earth Search v1, with COGs on S3 in us-west-2.

Controlled variables: same scenes, AOIs, sampler, DataLoader, and stack_samples collate path. Both paths output the same [batch, T, C, H, W] tensor shape.

TorchGeo runs with GDAL settings based on Pangeo COG best practices:

GDAL_DISABLE_READDIR_ON_OPEN = "EMPTY_DIR"
AWS_NO_SIGN_REQUEST = "YES"
GDAL_MAX_RAW_BLOCK_CACHE_SIZE = "200000000"
GDAL_SWATH_SIZE = "200000000"
VSI_CURL_CACHE_SIZE = "200000000"

TorchGeo / rasterio Baseline¶

TorchGeo's native path reads remote COGs through GDAL/rasterio. Rasteret reads the pre-built collection metadata and fetches only pixel byte ranges.

Cold start:

Scenario	TorchGeo/rasterio	Rasteret	Speedup	Shape
Single AOI, 15 scenes	9.08 s	1.14 s	8.0x	`[2, 15, 1, 256, 256]`
Multi-AOI, 30 scenes	42.05 s	2.25 s	18.7x	`[4, 30, 1, 256, 256]`
Cross-CRS, 12 scenes	12.47 s	0.59 s	21.3x	`[2, 12, 1, 256, 256]`

Warm cache, immediate re-run:

Scenario	TorchGeo/rasterio	Rasteret	Speedup	Shape
Single AOI, 15 scenes	9.14 s	0.81 s	11.3x	`[2, 15, 1, 256, 256]`
Multi-AOI, 30 scenes	29.68 s	2.60 s	11.4x	`[4, 30, 1, 256, 256]`
Cross-CRS, 12 scenes	3.61 s	1.06 s	3.4x	`[2, 12, 1, 256, 256]`

TorchGeo/rasterio vs Rasteret processing time

Benchmark breakdown

Where The Difference Comes From¶

Step	TorchGeo/rasterio path	Rasteret path
Index/header metadata	`rasterio.open()` per COG over HTTP	Pre-built Parquet collection metadata
Time-series read	Sequential `rasterio.merge()` per timestep	Timesteps/bands fetched with `asyncio.gather`
HTTP per timestep	Header + pixel ranges	Pixel ranges, because headers are cached
Concurrency	Mostly sequential in this benchmark path	Concurrent byte-range reads

Index/header time means:

TorchGeo/rasterio: time spent opening remote files and parsing TIFF IFD metadata over HTTP.
Rasteret: time to read the pre-built collection index from local storage.

Google Earth Engine / Time-Series Baseline¶

This separate time-series comparison measures Rasteret against Google Earth Engine and a thread-pooled rasterio path:

Library	First run (cold)	Subsequent runs (hot)
Rasterio + ThreadPool	32 s	24 s
Google Earth Engine	10-30 s	3-5 s
Rasteret	3 s	3 s

Single time-series request performance

200 COG comparison

Actual analysis time

Hugging Face `datasets` Baseline¶

This benchmark compares Rasteret with image-bytes-inside-Parquet workflows using Hugging Face datasets and Major TOM-style keyed patch access.

Patches	HF `datasets` parquet filters	Rasteret index + COG	Speedup
120	46.83 s	12.09 s	3.88x
1000	771.59 s	118.69 s	6.50x

HF vs Rasteret processing time

HF vs Rasteret speedup

The point is not that images-inside-Parquet is never useful. It is that for large cloud COG collections, Rasteret can keep pixels in the published COGs and use Parquet as the queryable index.

Cost And Scaling Views¶

The following figures summarize supporting cost/scaling views from the same benchmark asset set.

Environment scaling costs

AWS service-wise costs

Total VM hours

Reproducibility¶

# Fresh run
uv run python -m nbconvert --execute docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb

# Immediate re-run
uv run python -m nbconvert --execute docs/tutorials/05_torchgeo_benchmark_rasteret_vs_rasterio.ipynb

Results vary with network conditions, instance placement, cloud credentials, and provider rate limits.

Why Cold Starts Matter¶

Every new notebook kernel, VM, Kubernetes pod, CI runner, or colleague's fresh environment starts cold. In the rasterio/GDAL path, remote COG headers are re-read to discover tile offsets and byte counts. Rasteret stores that header metadata in the collection, so repeated reads can start from the cached index and go straight to pixel byte ranges.

If you run Rasteret on bigger collections, different sensors, or production pipelines, share timings in GitHub Discussions or Discord.