Changelog¶
v0.3.13¶
Added¶
- Tabular AOIs for polygon reads:
get_numpy(...),get_xarray(...), andget_gdf(...)now acceptgeometry_column=...whengeometriesis an Arrow/GeoArrow AOI table or Arrow C stream producer. - Arrow-native point-table metadata:
sample_points(...)now preserves non-geometry point-table columns in the returned Arrow table. - GeoArrow CRS inference for AOIs and points: Rasteret now uses GeoArrow CRS
metadata when available and requires
geometry_crs=...for tabular geometry inputs that do not carry CRS metadata.
Changed¶
- GeoDataFrame pixel outputs preserve AOI metadata:
get_gdf(...)joins non-geometry AOI table columns back to results bygeometry_id. - Runtime geometry docs now focus on user tools: the former enriched collection workflow guide is now Bring Your Own AOIs, Points, And Metadata.
v0.3.12¶
Fixed¶
- Planar multi-band point sampling batching:
sample_points()now batches distinct per-band tile ranges for planar COGs such as AEF while preserving per-band offsets and values. - Point sampling concurrency:
match="all"now samples matching rasters in bounded batches with one shared COG reader instead of awaiting each raster serially. - Point-aware index filtering: point sampling now uses per-point GeoArrow geometry filters instead of one broad point-envelope bbox when narrowing indexed collections.
- Indexed collection shard narrowing on remote filesystems:
source_partpruning now preserves the backing PyArrow filesystem and avoids reapplying record-index filters to already-planned scan datasets. - Source-part direct scans: indexed reads now open only the selected
source_partParquet shards when available instead of opening the full/datadataset before pruning.
v0.3.11¶
Added¶
- Native Arrow / GeoArrow Collection export:
Collectionnow implements Arrow PyCapsule protocol surfaces for stream, array, and schema export, allowing Arrow-compatible consumers to read collection metadata directly. - GeoArrow WKB footprint metadata: Rasteret now exports the
geometryfootprint column asgeoarrow.wkbwith CRS metadata set toOGC:CRS84. - Row-level raster CRS sidecar: enriched and normalized collections now carry
an Arrow-friendly
crsstring sidecar such asEPSG:32632, while preserving existingproj:epsgfor Rasteret runtime compatibility. Collection.crs: returns unique row-level raster CRS codes as strings, usingcrswhen present and falling back to legacyproj:epsg.- Arrow inputs for table builders:
build_from_table(...)now accepts Arrow-compatible in-memory objects and PyCapsule producers, including DuckDB/Polars-style Arrow stream exporters.
Changed¶
- Footprint CRS and raster CRS are now separated in Arrow exports:
GeoArrow metadata on
geometrydescribes the footprint geometry CRS (OGC:CRS84), while raster-native CRS remains row-level metadata incrsandproj:epsg. - COG enrichment writes consistent CRS sidecars: when parsed GeoTIFF header
CRS is used to backfill CRS metadata, Rasteret now keeps
crsandproj:epsgaligned. - Hugging Face streaming collections participate in Arrow export: HF-backed collections now expose enriched Arrow schemas and batches instead of appearing empty to Arrow consumers.
- Band/cloud read errors are more actionable: unresolved band requests and cloud read failures now include available asset/metadata keys, data-source mapping context, and provider-specific credential/requester-pays hints.
- Point sampling output schema is leaner:
sample_points()no longer emitscloud_coverby default, keeping the result focused on point, record, band, value, and CRS fields. - Docs now follow a reader-first path: the MkDocs nav now keeps Getting Started, Concepts, and Transitioning from Rasterio as the first linear path, with task-focused How-To Guides before notebook Tutorials.
- How-to and explanation docs were consolidated: table/Arrow workflows, collection management, catalog/cloud access, point sampling, TorchGeo, AEF, schema, interop, and benchmark pages were tightened around the current Collection APIs and CRS model.
- Benchmark docs now include the README evidence set: the benchmark page now includes the Google Earth Engine time-series table and the supporting benchmark/cost/scaling images from the docs asset set.
- AEF similarity notebook clarity: renamed intermediate variables by access path, removed the DuckDB pivot from the reference-vector step, and made each approach produce comparable match polygons.
- Record table terminology cleanup: refreshed Parquet/Arrow table examples
to prefer "record table" wording while keeping
--manifest-urlas a legacy CLI alias.
Fixed¶
- Fixed Arrow single-batch export for empty collections with non-empty schemas.
- Fixed requested-schema PyCapsule handling by importing requested Arrow schema capsules before applying compatible PyArrow casts.
- Fixed mixed-raster-CRS Arrow export so raster CRS is not incorrectly written as geometry column CRS.
- GeoDataFrame raster transforms:
get_gdf(...)rows now include the read-windowtransform, making record-wise array outputs easier to vectorize or map without reconstructing georeferencing in user code.
Tested¶
- Full test suite during Arrow interop validation:
364 passed, 42 skipped. - Verified Arrow interop with PyArrow table/reader/capsule paths.
- Verified
GeoDataFrame.from_arrow(collection)reads footprint geometry asOGC:CRS84. - Verified DuckDB 1.5.1 imports geometry as
GEOMETRY('OGC:CRS84')and keepscrsasVARCHAR. - Verified LanceDB 0.30.2 preserves
geoarrow.wkband thecrssidecar. - Verified docs with
uv run --extra docs mkdocs build --strict -d /tmp/rasteret-mkdocs-check.
v0.3.10¶
Changed¶
- AEF load alias routing:
rasteret.load("aef/v1-annual")now opens the published Source Cooperative/datacollection while keepingindex.parquetattached as the runtime sidecar index for pushdown. The publicload()implementation did not change; this is a catalog descriptor routing fix. - Examples cleanup: removed the older
ml_training_with_splits.pyscript; the split/label pattern remains documented in the ML training how-to and TorchGeo tutorial notebooks. - AEF how-to refresh: rewrote the AEF embeddings guide around the
maintained
rasteret.load("aef/v1-annual")flow and moved build/DuckDB guidance to custom-data use cases.
Fixed¶
- Collection export bbox normalization:
Collection.export()now normalizes legacy list-stylebboxcolumns through a chunk-safe struct conversion, avoidingChunkedArray/duplicatebboxfailures on export-reload workflows.
v0.3.9¶
Changed¶
- Filtered CRS auto-detection for xarray reads: internal batch iteration
now respects active collection filters when scanning
proj:epsgforget_xarray(...)auto-CRS selection. This avoids unintended reprojection on filtered subsets. - Nodata-aware integer xarray merge path: for
xr_combine="combine_first", Rasteret now prefers an integer-preserving merge path when_FillValuemetadata is present and consistent across overlapping datasets. - Per-band nodata propagation into xarray assembly: COG read nodata is
carried through RasterAccessor band results and attached as
_FillValueduring xarray construction.
Fixed¶
- AEF
get_xarraymemory blow-up regression: fixed a path where filtered reads could trigger unnecessary reprojection andfloat32upcast, causing significantly higher peak RAM than expected. - TorchGeo instant-timestamp slice regression:
RasteretGeoDataset.__getitem__now handles zero-length temporal slices (t.start == t.stop) with a closed interval so overlap checks do not drop valid records/chips.
Docs¶
- Updated
notebooks/07_aef_similarity_search.ipynbwith explicit memory notes for Franklin County / AEF 64-band runs: get_xarray: highest peak memory (~20-24 GB)get_gdf: medium peak memory (~3 GB)to_torchgeo_dataset: lower bounded-memory path (~3-5 GB)- Updated the notebook summary table to include these concrete memory ranges.
- Updated AEF notebooks to clarify that the AEF collection is already prebuilt
for 2018–2024 (Source Cooperative + Hugging Face), and that
rasteret.build()can be skipped in these workflows. - Updated parquet how-to/examples to use record table terminology and a first-run-friendly default SourceCoop Maxar record table.
Tested¶
- Verified
_detect_target_crs(...)on filtered subsets now returnsNonewhen all selected records are same-CRS. - Verified
sub.get_xarray(...)on AEF Franklin subset preservesint8dtype after merge. - Reproduced and fixed TorchGeo chip sampling regression:
from
used=0, skipped=16toused=16, skipped=0on the same Franklin sampler workload.
v0.3.8 - Hotfix¶
- Fix catalog.py to use sourcecoop and HF AEF index.parquet instead of reading large collection parquets
v0.3.7¶
Changed¶
- TorchGeo adapter semantic alignment:
Collection.to_torchgeo_dataset(...)now follows TorchGeoRasterDatasetquery behavior more closely. time_series=True: applies temporal overlap from the sampler/query slice, then stacks selected records into[T, C, H, W].time_series=False: mosaics overlapping records on the query grid with first-record precedence and nodata-aware filling.- TorchGeo docs clarity: interop and reference docs now describe the updated time-series and non-time-series slice behavior.
Tested¶
- Expanded
test_torchgeo_error_propagationwith semantic coverage for: - query-temporal filtering in
time_series=True - overlapping-record mosaicking in
time_series=False - Updated
test_torchgeo_networkwording to match the query-slice semantics.
v0.3.6¶
Added¶
- Bounded nodata fallback for
sample_points(): point sampling can now search outward from the base pixel under the input point usingmax_distance_pixels, measured in Chebyshev distance (square rings). - Neighbourhood window output for
sample_points():return_neighbourhoodnow supports: "off": no neighbourhood column"always": always return the full searched window"if_center_nodata": return the window only when the base pixel is nodata/NaN
Changed¶
- Point sampling semantics are now explicit and bounded:
sample_points()still samples the pixel containing the point first, but when that base pixel is nodata/NaN andmax_distance_pixels > 0, Rasteret searches outward ring by ring and chooses the closest valid candidate by exact point-to-pixel-rectangle distance. - Neighbourhood output schema: when
return_neighbourhood != "off",sample_points()returns a nullableneighbourhood_valueslist column in row-major order. With"if_center_nodata", rows whose base pixel is valid keepneighbourhood_values = NULL. - Neighbourhood mode validation: neighbourhood output now requires
max_distance_pixels > 0so the requested window size is always explicit.
Tested¶
- Expanded
test_executionfor bounded nodata fallback, exact-distance winner selection, cross-tile fallback, neighbourhood window output, NULL-window behavior for"if_center_nodata", and empty-result schema stability. - Expanded
test_public_api_surfacefor the newsample_points()arguments.
v0.3.5¶
Added¶
- Descriptor-backed dual-surface planning: catalog entries can now
describe separate record-table, index, and collection surfaces via
record_table_uri,index_uri, andcollection_uri. This lets Rasteret keep build-time metadata sources separate from published runtime collections. Collection.head(): first-class metadata preview API that uses the narrow record index when available instead of forcing a wide collection scan.- Internal Parquet read planner:
ParquetReadPlannerkeeps index-side and wide-scan filter state coherent acrosssubset(),where(),len(),head(), pixel reads, and TorchGeo entry points. - Hugging Face streaming runtime:
HFStreamingSourceand batch iterators now provide a stable metadata path forhf://datasets/...collections without routing through the olderdatasetsstreaming shutdown path.
Changed¶
- Index-first runtime filtering: when a collection has both a narrow index and a wide read-ready surface, Rasteret now plans metadata filters against the index first, then carries compatible predicates into the wide scan.
- GeoParquet bbox contract: bbox filtering now targets the canonical
bboxstruct (xmin,ymin,xmax,ymax) instead of older scalar bbox columns. Catalog tests, CLI fixtures, and filter tests were updated to match the GeoParquet 1.1 shape. - TorchGeo filter propagation:
to_torchgeo_dataset()now respects collection filters and geometry / bbox narrowing before chip sampling starts, reducing unnecessary raster candidates for gridded reads. - Local Parquet metadata reuse: local datasets prefer a
_metadatasidecar when present, and Rasteret now keeps an in-process Parquet dataset cache to reduce repeated footer/schema setup during interactive sessions. - Catalog surface reporting: CLI output and descriptor helpers now surface
record-table, index, and collection paths explicitly instead of collapsing
everything into
geoparquet_uri.
Fixed¶
- Hugging Face filter execution: Arrow expression handling on HF-backed metadata batches now fails more clearly and avoids the unstable runtime path that could crash during shutdown or long scans.
- AEF runtime aliasing: the built-in AEF descriptor now points at the published Terrafloww collection/index surfaces with explicit field-role and filter-capability hints.
- Read-path correctness on filtered collections:
len(),head(),sample_points(), TorchGeo reads, and other collection-backed reads now stay aligned when filters are staged across both record-index and wide-data surfaces.
Tested¶
- Expanded
test_collection_filtersfor bbox-struct filtering and TorchGeo prefilter behavior. - Expanded
test_catalog,test_cli, andtest_public_api_surfacefor the new descriptor surface roles and runtime load/build paths. - Expanded
test_huggingface,test_execution, andtest_torchgeo_adapterfor the new HF runtime path, filter propagation, and collection-read behavior.
v0.3.4¶
Added¶
- HuggingFace integration:
rasteret.load("hf://datasets/<org>/<repo>")resolves remote Parquet shards viahuggingface_huband loads them as a Collection. No local clone or download step needed. - AEF v1 Annual Rasteret Collection: prebuilt read-ready collection published on HuggingFace and Source Cooperative. 235K tiles, 64-band int8 embeddings, global coverage 2017–2024.
- AEF similarity search tutorial (
notebooks/07_aef_similarity_search): end-to-end embedding similarity search usingsample_points,get_xarray,get_gdf, and TorchGeoGridGeoSampler. Demonstrates DuckDB Arrow-native pivot for reference vectors and lonboard GPU-accelerated visualization.
Changed¶
- Unified tile engine:
RasterAccessorconsolidated into a single_read_tile()code path shared byget_xarray,get_numpy,get_gdf, and TorchGeo__getitem__. Four divergent tile-read implementations replaced with one. - COGReader simplified: removed duplicate decode paths, tightened the fetch → decompress → crop pipeline, reduced code surface by ~200 lines.
- TorchGeo edge-chip hardening: empty-read validation returns nodata-filled tensors for chips outside coverage; positive-overlap filtering skips false bbox-only candidates; fallback loop fills chips with nodata when all candidate tiles fail instead of crashing the DataLoader.
- Error surfacing:
get_gdfandget_numpywarn on partial band/geometry/record failures instead of returning silent empty results. Point geometry AOIs raiseUnsupportedGeometryErrorpointing tosample_points(). Exception chaining (raise ... from e) throughout the read pipeline.
Fixed¶
sample_pointsCOG read path: tile metadata validation was skipping valid tiles when the source raster had multiple matching records.xr_combineparameter now correctly plumbed throughget_xarray().
Tested¶
test_torchgeo_error_propagation: edge-chip, empty-read, and fallback loop coverage.test_huggingface: URI resolution and table loading mocks.test_public_api_surface: validates all public imports fromrasteret.test_public_network_smoke: live AEF reads and point sampling parity againstrasterio.sample().- Extended
test_executionwithget_gdferror path coverage.
v0.3.3¶
Performance¶
- Arrow-batch native point sampling:
sample_points()internals now use vectorized NumPy gathers andTable.from_batchesinstead of per-sample Python append +pa.concat_tables. Eliminates row materialization in the hot loop. - COGReader session reuse:
read_cog()accepts a sharedreader=parameter. Point sampling across multiple rasters now reuses a singleCOGReader(and its HTTP/2 connection pool) instead of creating one per raster, reducing connection overhead for multi-scene workloads.
Refactored¶
- Dedicated
point_samplingmodule: point sampling ownership moved fromexecution.pytorasteret.core.point_sampling.execution.pystays focused on area/chip reads (get_xarray,get_numpy,get_gdf). POINT_SAMPLES_SCHEMAdefined intypes.pyas the single source of truth for point sample output columns. Nullable fields explicitly modeled.- Point input helpers consolidated:
ensure_point_geoarrow,candidate_point_indices_for_raster, and related helpers moved intogeometry.py. - Strict tile/source alignment guard in
RasterAccessor.sample_points: validates that tile metadata matches the source raster before sampling.
Changed¶
- Landing page (
docs/index.md) code example simplified: single PyArrow import, cleanersample_pointscall, HF benchmark collapsed into admonition.
v0.3.2¶
Added¶
Collection.sample_points(): first-class point sampling API returning apyarrow.Table(point_index,record_id,datetime,band,value, CRS columns, and metadata fields). Supportsmatch="all"andmatch="latest".- New guide: Point Sampling and Masking.
Changed¶
- Masking controls surfaced at Collection level:
get_xarray(),get_gdf(), andget_numpy()now acceptall_touched=...directly. - Table-native point sampling inputs:
Collection.sample_points(...)now accepts Arrow tables and common dataframe/relation inputs withx_column/y_columnorgeometry_column(WKB/GeoArrow/shapely point column). - Point sampling aligns to rasterio
sample()semantics (nearest-pixel index math) for deterministic parity on real COGs. - Core read internals are more Arrow-native:
iterate_rasters()now scans columnar batches (nobatch.to_pylist()in the core read iterator),infer_data_source()uses filteredscanner.head(1),- multi-CRS detection in xarray path streams
proj:epsgcounts per batch, sample_points()uses vectorizedgeoarrow.point_coords.- Error messaging for non-OGC binary geometry input now explicitly points
DuckDB users to
ST_AsWKB(geom)when needed. - Network parity coverage is tighter:
- AOI windowing now matches rasterio
geometry_window()edge semantics exactly (fixes the WorldCover 1-pixel mismatch), - transient STAC API timeouts are retried during live builds,
- the AEF south-up TorchGeo oracle path is corrected for manual/explicit
parity runs via
WarpedVRT.
Tested¶
- Public network smoke:
test_public_network_smoke.pypasses with--network, including point sampling parity againstrasterio.sample(). - Network smoke (
test_network_smoke.py) passes for available providers (expected skips remain for missing optional extras/credentials).
Packaging¶
sedonadb>=0.2.0added to theexamplesextra for Arrow-native point workflow examples.
v0.3.1¶
Added¶
as_collection(): lightweight re-entry from an in-memory Arrow table or dataset back into a Collection. Validates contract columns and COG band metadata structs without re-running ingest, enrichment, or persistence. Use this after enriching a Collection's table with DuckDB, Polars, PyArrow, or any other tool. See Enriched Collection Workflows.
Changed¶
- Lifecycle docs: all entry points (
build*,load,as_collection) now cross-reference each other in docstrings and how-to guides. Contributing guide updated with four-layer architecture (Build → Query → Read → Re-entry). - Major TOM on-the-fly example uses
as_collection()+ explicitexport()instead ofbuild_from_table(enrich_cog=False).enrich_major_tom_columns()now preservesyear/monthpartition columns from the base Collection.
v0.3.0¶
Highlights¶
- License changed from AGPL-3.0-only to Apache-2.0.
- Dataset catalog:
build()with 12 pre-registered datasets across Earth Search, Planetary Computer, and AlphaEarth Foundation. Catalog entries can point to a STAC API or a GeoParquet file.register_local()for adding your own. build_from_stac()andbuild_from_table(): build a Collection from any STAC API or any Parquet/GeoParquet file with COG URLs (Source Cooperative exports, STAC GeoParquet, custom catalogs). No STAC API required for the table path. Optionalenrich_cog=Trueparses COG headers for accelerated reads.- Multi-cloud support: S3, Azure Blob, and GCS routing via URL auto-detection, with automatic fallback to anonymous access. obstore replaces direct aiohttp as the HTTP transport, adding unified multi-cloud routing.
create_backend()for authenticated reads with credential providers (e.g., Planetary Computer SAS tokens).- TorchGeo adapter:
collection.to_torchgeo_dataset()returns aGeoDatasetbacked by Rasteret's async COG reader. Supportstime_series=True([T, C, H, W]output),label_fieldfor per-sample labels,target_crsfor cross-CRS reprojection,allow_resample=Truefor mixed-resolution bands, andis_image=Falsefor mask-style datasets. Works with all TorchGeo samplers, collation helpers, and dataset composition (IntersectionDataset,UnionDataset). - Native dtype preservation: COG tiles return in their source dtype (uint16, int8, float32, etc.) instead of forcing float32.
- Rasterio-aligned masking: AOI reads default to
all_touched=Falseand fill outside-coverage pixels withnodatawhen present, otherwise0.read_cogreturns avalid_mask. - rioxarray removed: CRS encoding uses pyproj CF conventions directly.
The
xarrayextra no longer pulls rioxarray. - Extended TIFF header parsing: nodata, SamplesPerPixel, PlanarConfiguration, PhotometricInterpretation, ExtraSamples, GeoDoubleParams CRS support.
- Multi-CRS auto-reprojection: queries spanning multiple UTM zones
reproject to the most common CRS using GDAL's
calculate_default_transform. get_numpy(): lightweight NumPy output path returning[N, H, W](single band) or[N, C, H, W](multi-band) arrays. No extra dependencies beyond NumPy. Accepts bbox tuples, Arrow arrays, Shapely objects, or raw WKB.get_gdf(): GeoDataFrame output path for analysis workflows.- Enriched Parquet workflows: append arbitrary columns (splits, labels, AOI polygons, model scores) to a Collection's Parquet, query with DuckDB/PyArrow, and fetch pixels for matching rows on demand. See Enriched Collection Workflows.
- Major TOM on-the-fly: example workflow rebuilding Major TOM-style
patch-grid semantics from source Sentinel-2 COGs instead of
payload-in-Parquet. Benchmarked 3.9-6.5x faster than HF
datasetsParquet-filter reads. earthdataoptional extra:pip install rasteret[earthdata]for NASA Earthdata auto-credential detection.
Collection API¶
- Output paths:
get_xarray(),get_numpy(),get_gdf(),to_torchgeo_dataset(). All share the same async tile I/O underneath. - Inspection:
.bands,.bounds,.epsg,len(),__repr__(),.describe(),.compare_to_catalog(). - Filtering:
collection.subset(cloud_cover_lt=..., date_range=..., bbox=..., geometries=..., split=...)andcollection.where(expr)for raw Arrow expressions. - Sharing:
collection.export("path/")writes a portable copy;rasteret.load("path/")reloads it.list_collections()discovers cached collections in the workspace. - Three-tier schema: required columns (
id,datetime,geometry,assets), COG acceleration columns (per-band tile offsets and metadata), and user-extensible columns (split,label,cloud_cover, custom metadata). See Schema Contract. - Public exports:
Collection,CloudConfig,BandRegistry,DatasetDescriptor,DatasetRegistryare all importable fromrasteret.
Other changes¶
- Arrow-native geometry internals (GeoArrow replaces Shapely in hot paths).
- obstore as base dependency (multi-cloud support).
- CLI:
rasteret collections build|list|info|delete|import,rasteret datasets list|info|build|register-local|export-local|unregister-local. - TorchGeo
time_series=Trueuses spatial-only intersection, matching TorchGeo's ownRasterDatasetbehaviour where all spatially overlapping records are stacked regardless of the sampler's time slice. - Cloud workspace URIs (e.g.
s3://bucket/path) are preserved correctly inCollectionBuilderbase class.
Tested¶
- All four output paths (xarray, GeoDataFrame, NumPy, TorchGeo) tested against direct rasterio reads across 12 datasets (Sentinel-2, Landsat, NAIP, Copernicus DEM, ESA WorldCover, AEF, and more).
- TorchGeo adapter verified against the full GeoDataset contract:
IntervalIndex, samplers, collation, dataset composition, cross-CRS reprojection, and export/reload roundtrips.
Requirements¶
- Python 3.12+ required.
rasterio>=1.4.3,<1.5.0is a core dependency (used for geometry masking, CRS reprojection, and TorchGeo query-grid placement; not in the tile-read path).
Breaking changes¶
get_xarray()returns data in native COG dtype instead of always float32. Code that assumed float32 output may need adjustment.- The
xarrayextra no longer installs rioxarray. If you depend onds.rio.*methods, install rioxarray separately.