Collection Management¶
Learn how to manage a Rasteret Collection and its local lifecycle: build/import, inspect, export, reload, filter, or delete.
For first-time data ingest details, use:
- Build from Parquet and Arrow Tables for external Parquet, GeoParquet, DuckDB, Polars, or Arrow record tables.
- Dataset Catalog for built-in dataset IDs and local catalog registration.
- Custom Cloud Provider for requester-pays, authenticated buckets, URL rewriting, and custom backends.
What Gets Managed¶
A Rasteret Collection is an Arrow/Parquet table of raster records. It stores the metadata Rasteret needs for fast reads, including record IDs, footprints, assets, CRS sidecars, and optional COG header metadata. Pixel bytes stay in the source COGs.
The default local workspace is:
Rasteret uses two suffixes for cached collection directories:
| Suffix | Created by |
|---|---|
_stac |
build(), build_from_stac(), rasteret collections build |
_records |
build_from_table(), rasteret collections import |
You can pass workspace_dir=... in Python or --workspace-dir ... in the CLI
when you want a different location.
CLI Workflow¶
Build or refresh a STAC-backed collection cache:
rasteret collections build bangalore \
--stac-api https://earth-search.aws.element84.com/v1 \
--collection sentinel-2-l2a \
--bbox 77.55,13.01,77.58,13.08 \
--date-range 2024-01-01,2024-06-30
Import an existing Parquet or GeoParquet record table:
collections import materializes a local collection from the record table. If
your source table still needs COG header enrichment for pixel reads, use the
Python build_from_table(..., enrich_cog=True) path instead.
Inspect local caches:
Delete a local cache:
Use --json on build, import, list, and info when scripting. Use
--yes with delete to skip the confirmation prompt.
Python Entry Points¶
Choose the entry point based on what you already have:
| Situation | Use |
|---|---|
| Registered catalog dataset | rasteret.build("catalog/id", ...) |
| Custom STAC API | rasteret.build_from_stac(...) |
| External Parquet/GeoParquet/Arrow record table | rasteret.build_from_table(...) |
| Existing exported collection artifact | rasteret.load(path) |
| In-memory read-ready Arrow object | rasteret.as_collection(obj) |
Example:
import rasteret
collection = rasteret.build(
"earthsearch/sentinel-2-l2a",
name="bangalore",
bbox=(77.55, 13.01, 77.58, 13.08),
date_range=("2024-01-01", "2024-06-30"),
)
When name is provided, Rasteret can cache the built collection in the default
workspace with the given name as the folder name.
Pass force=True when you want to rebuild an existing cache.
Inspect A Collection¶
Use describe() for a human-readable summary:
The returned object is also programmatic:
Common quick checks:
If the collection came from a registered catalog dataset, compare it to the source descriptor:
This is useful when you built a date or AOI subset and want to see how it relates
to the full catalog entry. If there is no matching catalog descriptor, use
collection.describe() instead.
Export And Reload¶
Export a portable collection artifact:
Reload it later:
The export contains the collection metadata table, labels/splits you added, COG header metadata, and asset references. It does not copy the source raster pixel bytes.
List Cached Collections¶
from pathlib import Path
from rasteret import Collection
for item in Collection.list_collections(Path.home() / "rasteret_workspace"):
print(item["name"], item["kind"], item["size"])
or
Each item includes fields such as name, kind, data_source, date_range,
size, and created.
Filter Metadata Before Reading¶
Use subset() for common filters:
train = collection.subset(
cloud_cover_lt=15,
date_range=("2024-03-01", "2024-04-30"),
bbox=(77.55, 13.03, 77.57, 13.06),
split="train",
)
Use where() for custom Arrow dataset expressions:
Then read pixels from the filtered collection:
For labels, splits, AOI joins, point tables, and other business metadata, see Bring Your Own AOIs, Points, And Metadata.