Skip to content

rasteret.catalog

Catalog entries and the in-memory registry used by build() and rasteret datasets ....

catalog

Dataset registry: spec-aligned descriptors for known COG collections.

Each :class:DatasetDescriptor is a proto-spec-descriptor that captures identity, access, and band-mapping metadata for a cloud-native GeoTIFF collection. The :class:DatasetRegistry stores them in-memory and auto-populates :class:~rasteret.constants.BandRegistry and :class:~rasteret.cloud.CloudConfig keyed by STAC collection id.

Users can register custom datasets at runtime::

import rasteret
from rasteret.catalog import DatasetDescriptor

rasteret.register(DatasetDescriptor(
    id="acme/field-survey-2024",
    name="ACME Field Survey",
    stac_api="https://acme.example.com/stac/v1",
    stac_collection="field-survey-2024",
    band_map={"RGB": "image"},
    separate_files=False,
    license="proprietary",
    license_url="https://acme.example.com/license",
))

Classes

DatasetDescriptor dataclass

DatasetDescriptor(
    id: str,
    name: str,
    description: str = "",
    stac_api: str | None = None,
    stac_collection: str | None = None,
    geoparquet_uri: str | None = None,
    column_map: dict[str, str] | None = None,
    href_column: str | None = None,
    band_index_map: dict[str, int] | None = None,
    bbox_columns: dict[str, str] | None = None,
    band_map: dict[str, str] | None = None,
    separate_files: bool = True,
    spatial_coverage: str = "",
    temporal_range: tuple[str, str] | None = None,
    requires_auth: bool = False,
    license: str = "",
    license_url: str = "",
    commercial_use: bool = True,
    static_catalog: bool = False,
    s3_credentials_url: str | None = None,
    cloud_config: dict[str, str] | None = None,
    example_bbox: tuple[float, float, float, float]
    | None = None,
    example_date_range: tuple[str, str] | None = None,
    torchgeo_class: str | None = None,
    torchgeo_verified: bool = False,
)

A dataset descriptor: identity + access + band mapping.

Proto-spec-descriptor. Each entry will migrate to YAML format when the spec ships. Fields map to spec axes:

id, name, description             -> dataset identity
stac_api, stac_collection         -> access (stac_query)
geoparquet_uri                    -> access (parquet_record_table)
band_map                          -> field roles (input bands)
spatial_coverage, temporal_range  -> coverage metadata
license, license_url,
commercial_use                    -> licensing
static_catalog                    -> static STAC catalog traversal
column_map, href_column,
band_index_map, bbox_columns      -> normalisation hints

Parameters:

Name Type Description Default
id str

Namespaced identifier (e.g. "earthsearch/sentinel-2-l2a").

required
name str

Human-readable name.

required
description str

One-liner description.

''
stac_api str

STAC API endpoint URL. For static STAC catalogs (no /search endpoint), this is the URL to the root catalog.json file and static_catalog must be True.

None
stac_collection str

STAC collection identifier. May be None for static catalogs that should be traversed from the root.

None
geoparquet_uri str

URI to a GeoParquet record table.

None
column_map dict

{source: contract} alias map for GeoParquet normalisation. Source columns are preserved; contract-name columns are added as zero-copy aliases.

None
href_column str

Column in the GeoParquet containing COG URLs. When set and assets is absent, the normalisation layer builds the assets struct from this column and band_index_map.

None
band_index_map dict

{band_code: sample_index} for multi-band COGs. Used with href_column to construct per-band asset references with band_index.

None
bbox_columns dict

{"minx": col, "miny": col, "maxx": col, "maxy": col} mapping source column names for spatial filtering on the GeoParquet index. Used by build() to construct a filter_expr so only relevant rows are enriched.

None
band_map dict

Mapping of band code to STAC asset name.

None
separate_files bool

True when each band is a separate COG file (default).

True
spatial_coverage str

Geographic coverage hint (e.g. "global").

''
temporal_range tuple of str

(start, end) ISO date strings.

None
requires_auth bool

Whether credentials are needed to access the data.

False
license str

License identifier. Use the value reported by the STAC API (typically an SPDX id like "CC-BY-4.0" or "proprietary" for bespoke open-access licenses).

''
license_url str

URL to the full license text. Sourced from the STAC collection's rel=license link.

''
commercial_use bool

True (default) when the license permits commercial use. False for licenses like CC-BY-NC-4.0.

True
static_catalog bool

True when stac_api points to a static STAC catalog (a catalog.json on S3) rather than a queryable STAC API with a /search endpoint. Static catalogs are traversed with pystac.Catalog.from_file() and filtered client-side.

False
s3_credentials_url str

Endpoint for obtaining temporary S3 credentials for auth-gated datasets. When set, build() can auto-construct a backend using obstore credential providers and the user's .netrc / environment variables.

None
example_bbox tuple of float

Example bounding box (minx, miny, maxx, maxy) known to return data. Used in docs and live smoke tests.

None
example_date_range tuple of str

Example ISO date range (start, end) known to return data. Used in docs and live smoke tests.

None
cloud_config dict

Cloud provider configuration for URL resolution.

None
torchgeo_class str

Equivalent TorchGeo class name (reference only, not a dependency).

None
torchgeo_verified bool

True when the underlying data source has been confirmed to be the same files that the TorchGeo class reads.

False

DatasetRegistry

Registry of dataset descriptors. Proto-spec catalog.

Built-in datasets are registered at module import time. Users can add entries via :meth:register or the top-level :func:rasteret.register helper.

Functions
register classmethod
register(descriptor: DatasetDescriptor) -> None

Register a dataset descriptor.

Also populates :class:~rasteret.constants.BandRegistry and :class:~rasteret.cloud.CloudConfig keyed by the descriptor id so that provider-specific conventions do not collide (e.g. Planetary Computer vs Earth Search for sentinel-2-l2a).

Source code in src/rasteret/catalog.py
@classmethod
def register(cls, descriptor: DatasetDescriptor) -> None:
    """Register a dataset descriptor.

    Also populates :class:`~rasteret.constants.BandRegistry` and
    :class:`~rasteret.cloud.CloudConfig` keyed by the descriptor id so that
    provider-specific conventions do not collide (e.g. Planetary Computer
    vs Earth Search for ``sentinel-2-l2a``).
    """
    cls._descriptors[descriptor.id] = descriptor

    # Populate BandRegistry keyed by descriptor id (namespaced).
    # Skip if an entry already exists (first-write-wins).
    if descriptor.band_map:
        from rasteret.constants import BandRegistry

        if not BandRegistry.get(descriptor.id):
            BandRegistry.register(descriptor.id, descriptor.band_map)

    # Populate CloudConfig keyed by descriptor id (namespaced).
    if descriptor.cloud_config:
        from rasteret.cloud import CloudConfig

        CloudConfig.register(
            descriptor.id,
            CloudConfig(
                provider=descriptor.cloud_config.get("provider", "aws"),
                requester_pays=descriptor.cloud_config.get("requester_pays", False),
                region=descriptor.cloud_config.get("region", "us-west-2"),
                url_patterns=descriptor.cloud_config.get("url_patterns", {}),
            ),
        )
unregister classmethod
unregister(dataset_id: str) -> DatasetDescriptor | None

Remove a descriptor from the in-memory registry.

Source code in src/rasteret/catalog.py
@classmethod
def unregister(cls, dataset_id: str) -> DatasetDescriptor | None:
    """Remove a descriptor from the in-memory registry."""
    return cls._descriptors.pop(dataset_id, None)
get classmethod
get(dataset_id: str) -> DatasetDescriptor | None

Look up a descriptor by namespaced ID.

Parameters:

Name Type Description Default
dataset_id str

Full namespaced id (e.g. "earthsearch/sentinel-2-l2a").

required
Source code in src/rasteret/catalog.py
@classmethod
def get(cls, dataset_id: str) -> DatasetDescriptor | None:
    """Look up a descriptor by namespaced ID.

    Parameters
    ----------
    dataset_id : str
        Full namespaced id (e.g. ``"earthsearch/sentinel-2-l2a"``).
    """
    return cls._descriptors.get(dataset_id)
list classmethod

Return all registered descriptors.

Source code in src/rasteret/catalog.py
@classmethod
def list(cls) -> list[DatasetDescriptor]:
    """Return all registered descriptors."""
    return list(cls._descriptors.values())
search classmethod
search(keyword: str) -> list[DatasetDescriptor]

Search descriptors by keyword in id, name, or description.

Parameters:

Name Type Description Default
keyword str

Case-insensitive search term.

required
Source code in src/rasteret/catalog.py
@classmethod
def search(cls, keyword: str) -> list[DatasetDescriptor]:
    """Search descriptors by keyword in id, name, or description.

    Parameters
    ----------
    keyword : str
        Case-insensitive search term.
    """
    kw = keyword.lower()
    return [
        d
        for d in cls._descriptors.values()
        if kw in d.id.lower() or kw in d.name.lower() or kw in d.description.lower()
    ]

Functions

load_local_descriptors

load_local_descriptors(
    path: str | Path | None = None,
) -> list[DatasetDescriptor]

Load persisted local dataset descriptors from JSON.

Invalid entries are skipped with a warning.

Source code in src/rasteret/catalog.py
def load_local_descriptors(
    path: str | Path | None = None,
) -> list[DatasetDescriptor]:
    """Load persisted local dataset descriptors from JSON.

    Invalid entries are skipped with a warning.
    """
    registry_path = _local_registry_path(path)
    if not registry_path.exists():
        return []
    try:
        payload = json.loads(registry_path.read_text(encoding="utf-8"))
    except (json.JSONDecodeError, OSError) as exc:
        logger.warning(
            "Failed to read local dataset registry %s: %s", registry_path, exc
        )
        return []

    if not isinstance(payload, list):
        logger.warning(
            "Local dataset registry %s must contain a JSON list", registry_path
        )
        return []

    descriptors: list[DatasetDescriptor] = []
    for entry in payload:
        if not isinstance(entry, dict):
            continue
        try:
            descriptors.append(DatasetDescriptor(**entry))
        except TypeError as exc:
            dataset_id = entry.get("id", "<missing-id>")
            logger.warning(
                "Skipping invalid local dataset descriptor %s: %s", dataset_id, exc
            )
    return descriptors

save_local_descriptor

save_local_descriptor(
    descriptor: DatasetDescriptor,
    path: str | Path | None = None,
) -> None

Persist a local dataset descriptor to JSON (upsert by id).

Source code in src/rasteret/catalog.py
def save_local_descriptor(
    descriptor: DatasetDescriptor,
    path: str | Path | None = None,
) -> None:
    """Persist a local dataset descriptor to JSON (upsert by id)."""
    registry_path = _local_registry_path(path)
    existing = {d.id: d for d in load_local_descriptors(registry_path)}
    existing[descriptor.id] = descriptor
    _write_local_descriptors(list(existing.values()), registry_path)

remove_local_descriptor

remove_local_descriptor(
    dataset_id: str, path: str | Path | None = None
) -> DatasetDescriptor | None

Remove one persisted local descriptor (if present).

Source code in src/rasteret/catalog.py
def remove_local_descriptor(
    dataset_id: str,
    path: str | Path | None = None,
) -> DatasetDescriptor | None:
    """Remove one persisted local descriptor (if present)."""
    registry_path = _local_registry_path(path)
    descriptors = load_local_descriptors(registry_path)
    removed: DatasetDescriptor | None = None
    kept: list[DatasetDescriptor] = []

    for descriptor in descriptors:
        if descriptor.id == dataset_id:
            removed = descriptor
        else:
            kept.append(descriptor)

    if removed is None:
        return None

    _write_local_descriptors(kept, registry_path)
    return removed

unregister_local_descriptor

unregister_local_descriptor(
    dataset_id: str, path: str | Path | None = None
) -> DatasetDescriptor | None

Unregister a local dataset from persisted and in-memory registries.

Source code in src/rasteret/catalog.py
def unregister_local_descriptor(
    dataset_id: str,
    path: str | Path | None = None,
) -> DatasetDescriptor | None:
    """Unregister a local dataset from persisted and in-memory registries."""
    persisted = remove_local_descriptor(dataset_id, path=path)

    in_memory = DatasetRegistry.get(dataset_id)
    removed_in_memory: DatasetDescriptor | None = None
    if (
        in_memory is not None
        and in_memory.geoparquet_uri
        and in_memory.spatial_coverage == "local"
    ):
        removed_in_memory = DatasetRegistry.unregister(dataset_id)

    return persisted or removed_in_memory

export_local_descriptor

export_local_descriptor(
    dataset_id: str,
    output_path: str | Path,
    path: str | Path | None = None,
) -> Path

Export one local descriptor as JSON for sharing.

Source code in src/rasteret/catalog.py
def export_local_descriptor(
    dataset_id: str,
    output_path: str | Path,
    path: str | Path | None = None,
) -> Path:
    """Export one local descriptor as JSON for sharing."""
    descriptor = next(
        (entry for entry in load_local_descriptors(path) if entry.id == dataset_id),
        None,
    )

    if descriptor is None:
        runtime_descriptor = DatasetRegistry.get(dataset_id)
        if (
            runtime_descriptor is not None
            and runtime_descriptor.geoparquet_uri
            and runtime_descriptor.spatial_coverage == "local"
        ):
            descriptor = runtime_descriptor

    if descriptor is None:
        raise KeyError(f"Local dataset '{dataset_id}' not found.")

    destination = Path(output_path).expanduser()
    destination.parent.mkdir(parents=True, exist_ok=True)
    destination.write_text(
        json.dumps(asdict(descriptor), indent=2, sort_keys=True) + "\n",
        encoding="utf-8",
    )
    return destination