Skip to content

TorchGeo Integration

RasteretDataset in torchgeo.datasets is the supported integration point for using Rasteret collections in TorchGeo training pipelines. Install both packages and pass a collection directly:

from torchgeo.datasets import RasteretDataset
from torchgeo.samplers import RandomGeoSampler

dataset = RasteretDataset(collection=collection, bands=["B04", "B03", "B02"])
sampler = RandomGeoSampler(dataset, size=256, length=100)

This page covers the underlying public boundary that RasteretDataset (and any custom GeoDataset subclass) relies on:

  • collection.to_table(...) — for building the spatial/temporal sampling index
  • collection.read_window(...) — for chip-level pixel reads

That split keeps TorchGeo dataset semantics on the TorchGeo side while Rasteret owns collection metadata, COG planning, and byte-range pixel reads.

Build A Sampling Table

import rasteret

collection = rasteret.load("my_experiment")

table = collection.to_table(
    columns=[
        "id",
        "datetime",
        "geometry",
        "proj:epsg",
        "label",
        "B08_metadata",
        "B04_metadata",
        "B03_metadata",
    ],
)

TorchGeo can turn this Arrow-native collection metadata into its own GeoPandas index, choose a sampling CRS/resolution, and keep Rasteret focused on pixel reads.

Read A Fixed Grid Window

window = collection.read_window(
    record_ids=table.column("id").to_pylist()[:1],
    bounds=(500000.0, 999000.0, 501280.0, 1000000.0),
    res=(10.0, 10.0),
    bands=["B08", "B04", "B03"],
)

read_window(...) returns a NumPy array on the exact query grid. Overlapping records are mosaicked internally with fixed-grid semantics.

Filtering

Metadata and attribute filtering (cloud cover, date range, custom columns) belongs on the collection before construction. Spatial ROI is a sampler concern — pass roi= to RandomGeoSampler. dataset.index is a public GeoDataFrame and can be sliced after construction if needed.

subset(...) covers the common cases. For anything more complex — joins, custom expressions, multi-step transforms — the collection is Arrow-native, so you can work with it in DuckDB, Polars, or pandas and wrap the result back with rasteret.as_collection(table, data_source=collection.data_source).

Use collection.subset(...) for common filters:

train = collection.subset(cloud_cover_lt=20)
table = train.to_table(
    columns=[
        "id",
        "datetime",
        "geometry",
        "proj:epsg",
        "biomass_value",
        "B04_metadata",
        "B03_metadata",
        "B02_metadata",
        "B08_metadata",
    ],
)

For adding split and label columns before this step, see Bring Your Own AOIs, Points, And Metadata. For benchmark methodology and current numbers, see Benchmarks and the TorchGeo Benchmark: Rasteret vs Native Rasterio notebook.