rasteret.catalog¶

Catalog entries and the in-memory registry used by build() and rasteret datasets ....

catalog ¶

Dataset registry: spec-aligned descriptors for known COG collections.

Each :class:DatasetDescriptor is a proto-spec-descriptor that captures identity, access, and band-mapping metadata for a cloud-native GeoTIFF collection. The :class:DatasetRegistry stores them in-memory and auto-populates :class:~rasteret.constants.BandRegistry and :class:~rasteret.cloud.CloudConfig keyed by STAC collection id.

Users can register custom datasets at runtime::

import rasteret
from rasteret.catalog import DatasetDescriptor

rasteret.register(DatasetDescriptor(
    id="acme/field-survey-2024",
    name="ACME Field Survey",
    stac_api="https://acme.example.com/stac/v1",
    stac_collection="field-survey-2024",
    band_map={"RGB": "image"},
    separate_files=False,
    license="proprietary",
    license_url="https://acme.example.com/license",
))

Classes¶

DatasetDescriptor `dataclass` ¶

DatasetDescriptor(
    id: str,
    name: str,
    description: str = "",
    stac_api: str | None = None,
    stac_collection: str | None = None,
    geoparquet_uri: str | None = None,
    column_map: dict[str, str] | None = None,
    href_column: str | None = None,
    band_index_map: dict[str, int] | None = None,
    bbox_columns: dict[str, str] | None = None,
    band_map: dict[str, str] | None = None,
    separate_files: bool = True,
    spatial_coverage: str = "",
    temporal_range: tuple[str, str] | None = None,
    requires_auth: bool = False,
    license: str = "",
    license_url: str = "",
    commercial_use: bool = True,
    static_catalog: bool = False,
    s3_credentials_url: str | None = None,
    cloud_config: dict[str, str] | None = None,
    example_bbox: tuple[float, float, float, float]
    | None = None,
    example_date_range: tuple[str, str] | None = None,
    torchgeo_class: str | None = None,
    torchgeo_verified: bool = False,
)

A dataset descriptor: identity + access + band mapping.

Proto-spec-descriptor. Each entry will migrate to YAML format when the spec ships. Fields map to spec axes:

id, name, description             -> dataset identity
stac_api, stac_collection         -> access (stac_query)
geoparquet_uri                    -> access (parquet_record_table)
band_map                          -> field roles (input bands)
spatial_coverage, temporal_range  -> coverage metadata
license, license_url,
commercial_use                    -> licensing
static_catalog                    -> static STAC catalog traversal
column_map, href_column,
band_index_map, bbox_columns      -> normalisation hints

Parameters:

Name	Type	Description	Default
`id`	`str`	Namespaced identifier (e.g. `"earthsearch/sentinel-2-l2a"`).	required
`name`	`str`	Human-readable name.	required
`description`	`str`	One-liner description.	`''`
`stac_api`	`str`	STAC API endpoint URL. For static STAC catalogs (no `/search` endpoint), this is the URL to the root `catalog.json` file and `static_catalog` must be `True`.	`None`
`stac_collection`	`str`	STAC collection identifier. May be `None` for static catalogs that should be traversed from the root.	`None`
`geoparquet_uri`	`str`	URI to a GeoParquet record table.	`None`
`column_map`	`dict`	`{source: contract}` alias map for GeoParquet normalisation. Source columns are preserved; contract-name columns are added as zero-copy aliases.	`None`
`href_column`	`str`	Column in the GeoParquet containing COG URLs. When set and `assets` is absent, the normalisation layer builds the `assets` struct from this column and `band_index_map`.	`None`
`band_index_map`	`dict`	`{band_code: sample_index}` for multi-band COGs. Used with `href_column` to construct per-band asset references with `band_index`.	`None`
`bbox_columns`	`dict`	`{"minx": col, "miny": col, "maxx": col, "maxy": col}` mapping source column names for spatial filtering on the GeoParquet index. Used by `build()` to construct a `filter_expr` so only relevant rows are enriched.	`None`
`band_map`	`dict`	Mapping of band code to STAC asset name.	`None`
`separate_files`	`bool`	`True` when each band is a separate COG file (default).	`True`
`spatial_coverage`	`str`	Geographic coverage hint (e.g. `"global"`).	`''`
`temporal_range`	`tuple of str`	`(start, end)` ISO date strings.	`None`
`requires_auth`	`bool`	Whether credentials are needed to access the data.	`False`
`license`	`str`	License identifier. Use the value reported by the STAC API (typically an SPDX id like `"CC-BY-4.0"` or `"proprietary"` for bespoke open-access licenses).	`''`
`license_url`	`str`	URL to the full license text. Sourced from the STAC collection's `rel=license` link.	`''`
`commercial_use`	`bool`	`True` (default) when the license permits commercial use. `False` for licenses like `CC-BY-NC-4.0`.	`True`
`static_catalog`	`bool`	`True` when `stac_api` points to a static STAC catalog (a `catalog.json` on S3) rather than a queryable STAC API with a `/search` endpoint. Static catalogs are traversed with `pystac.Catalog.from_file()` and filtered client-side.	`False`
`s3_credentials_url`	`str`	Endpoint for obtaining temporary S3 credentials for auth-gated datasets. When set, `build()` can auto-construct a backend using `obstore` credential providers and the user's `.netrc` / environment variables.	`None`
`example_bbox`	`tuple of float`	Example bounding box (minx, miny, maxx, maxy) known to return data. Used in docs and live smoke tests.	`None`
`example_date_range`	`tuple of str`	Example ISO date range (start, end) known to return data. Used in docs and live smoke tests.	`None`
`cloud_config`	`dict`	Cloud provider configuration for URL resolution.	`None`
`torchgeo_class`	`str`	Equivalent TorchGeo class name (reference only, not a dependency).	`None`
`torchgeo_verified`	`bool`	`True` when the underlying data source has been confirmed to be the same files that the TorchGeo class reads.	`False`

DatasetRegistry ¶

Registry of dataset descriptors. Proto-spec catalog.

Built-in datasets are registered at module import time. Users can add entries via :meth:register or the top-level :func:rasteret.register helper.

Functions¶

register `classmethod` ¶

register(descriptor: DatasetDescriptor) -> None

Register a dataset descriptor.

Also populates :class:~rasteret.constants.BandRegistry and :class:~rasteret.cloud.CloudConfig keyed by the descriptor id so that provider-specific conventions do not collide (e.g. Planetary Computer vs Earth Search for sentinel-2-l2a).

Source code in src/rasteret/catalog.py

@classmethod
def register(cls, descriptor: DatasetDescriptor) -> None:
    """Register a dataset descriptor.

    Also populates :class:`~rasteret.constants.BandRegistry` and
    :class:`~rasteret.cloud.CloudConfig` keyed by the descriptor id so that
    provider-specific conventions do not collide (e.g. Planetary Computer
    vs Earth Search for ``sentinel-2-l2a``).
    """
    cls._descriptors[descriptor.id] = descriptor

    # Populate BandRegistry keyed by descriptor id (namespaced).
    # Skip if an entry already exists (first-write-wins).
    if descriptor.band_map:
        from rasteret.constants import BandRegistry

        if not BandRegistry.get(descriptor.id):
            BandRegistry.register(descriptor.id, descriptor.band_map)

    # Populate CloudConfig keyed by descriptor id (namespaced).
    if descriptor.cloud_config:
        from rasteret.cloud import CloudConfig

        CloudConfig.register(
            descriptor.id,
            CloudConfig(
                provider=descriptor.cloud_config.get("provider", "aws"),
                requester_pays=descriptor.cloud_config.get("requester_pays", False),
                region=descriptor.cloud_config.get("region", "us-west-2"),
                url_patterns=descriptor.cloud_config.get("url_patterns", {}),
            ),
        )

unregister `classmethod` ¶

unregister(dataset_id: str) -> DatasetDescriptor | None

Remove a descriptor from the in-memory registry.

Source code in src/rasteret/catalog.py

@classmethod
def unregister(cls, dataset_id: str) -> DatasetDescriptor | None:
    """Remove a descriptor from the in-memory registry."""
    return cls._descriptors.pop(dataset_id, None)

get `classmethod` ¶

get(dataset_id: str) -> DatasetDescriptor | None

Look up a descriptor by namespaced ID.

Parameters:

Name	Type	Description	Default
`dataset_id`	`str`	Full namespaced id (e.g. `"earthsearch/sentinel-2-l2a"`).	required

Source code in src/rasteret/catalog.py

@classmethod
def get(cls, dataset_id: str) -> DatasetDescriptor | None:
    """Look up a descriptor by namespaced ID.

    Parameters
    ----------
    dataset_id : str
        Full namespaced id (e.g. ``"earthsearch/sentinel-2-l2a"``).
    """
    return cls._descriptors.get(dataset_id)

list `classmethod` ¶

list() -> list[DatasetDescriptor]

Return all registered descriptors.

Source code in src/rasteret/catalog.py

@classmethod
def list(cls) -> list[DatasetDescriptor]:
    """Return all registered descriptors."""
    return list(cls._descriptors.values())

search `classmethod` ¶

search(keyword: str) -> list[DatasetDescriptor]

Search descriptors by keyword in id, name, or description.

Parameters:

Name	Type	Description	Default
`keyword`	`str`	Case-insensitive search term.	required

Source code in src/rasteret/catalog.py

@classmethod
def search(cls, keyword: str) -> list[DatasetDescriptor]:
    """Search descriptors by keyword in id, name, or description.

    Parameters
    ----------
    keyword : str
        Case-insensitive search term.
    """
    kw = keyword.lower()
    return [
        d
        for d in cls._descriptors.values()
        if kw in d.id.lower() or kw in d.name.lower() or kw in d.description.lower()
    ]

Functions¶

load_local_descriptors ¶

load_local_descriptors(
    path: str | Path | None = None,
) -> list[DatasetDescriptor]

Load persisted local dataset descriptors from JSON.

Invalid entries are skipped with a warning.

Source code in src/rasteret/catalog.py

def load_local_descriptors(
    path: str | Path | None = None,
) -> list[DatasetDescriptor]:
    """Load persisted local dataset descriptors from JSON.

    Invalid entries are skipped with a warning.
    """
    registry_path = _local_registry_path(path)
    if not registry_path.exists():
        return []
    try:
        payload = json.loads(registry_path.read_text(encoding="utf-8"))
    except (json.JSONDecodeError, OSError) as exc:
        logger.warning(
            "Failed to read local dataset registry %s: %s", registry_path, exc
        )
        return []

    if not isinstance(payload, list):
        logger.warning(
            "Local dataset registry %s must contain a JSON list", registry_path
        )
        return []

    descriptors: list[DatasetDescriptor] = []
    for entry in payload:
        if not isinstance(entry, dict):
            continue
        try:
            descriptors.append(DatasetDescriptor(**entry))
        except TypeError as exc:
            dataset_id = entry.get("id", "<missing-id>")
            logger.warning(
                "Skipping invalid local dataset descriptor %s: %s", dataset_id, exc
            )
    return descriptors

save_local_descriptor ¶

save_local_descriptor(
    descriptor: DatasetDescriptor,
    path: str | Path | None = None,
) -> None

Persist a local dataset descriptor to JSON (upsert by id).

Source code in src/rasteret/catalog.py

def save_local_descriptor(
    descriptor: DatasetDescriptor,
    path: str | Path | None = None,
) -> None:
    """Persist a local dataset descriptor to JSON (upsert by id)."""
    registry_path = _local_registry_path(path)
    existing = {d.id: d for d in load_local_descriptors(registry_path)}
    existing[descriptor.id] = descriptor
    _write_local_descriptors(list(existing.values()), registry_path)

remove_local_descriptor ¶

remove_local_descriptor(
    dataset_id: str, path: str | Path | None = None
) -> DatasetDescriptor | None

Remove one persisted local descriptor (if present).

Source code in src/rasteret/catalog.py

def remove_local_descriptor(
    dataset_id: str,
    path: str | Path | None = None,
) -> DatasetDescriptor | None:
    """Remove one persisted local descriptor (if present)."""
    registry_path = _local_registry_path(path)
    descriptors = load_local_descriptors(registry_path)
    removed: DatasetDescriptor | None = None
    kept: list[DatasetDescriptor] = []

    for descriptor in descriptors:
        if descriptor.id == dataset_id:
            removed = descriptor
        else:
            kept.append(descriptor)

    if removed is None:
        return None

    _write_local_descriptors(kept, registry_path)
    return removed

unregister_local_descriptor ¶

unregister_local_descriptor(
    dataset_id: str, path: str | Path | None = None
) -> DatasetDescriptor | None

Unregister a local dataset from persisted and in-memory registries.

Source code in src/rasteret/catalog.py

def unregister_local_descriptor(
    dataset_id: str,
    path: str | Path | None = None,
) -> DatasetDescriptor | None:
    """Unregister a local dataset from persisted and in-memory registries."""
    persisted = remove_local_descriptor(dataset_id, path=path)

    in_memory = DatasetRegistry.get(dataset_id)
    removed_in_memory: DatasetDescriptor | None = None
    if (
        in_memory is not None
        and in_memory.geoparquet_uri
        and in_memory.spatial_coverage == "local"
    ):
        removed_in_memory = DatasetRegistry.unregister(dataset_id)

    return persisted or removed_in_memory

export_local_descriptor ¶

export_local_descriptor(
    dataset_id: str,
    output_path: str | Path,
    path: str | Path | None = None,
) -> Path

Export one local descriptor as JSON for sharing.

Source code in src/rasteret/catalog.py

def export_local_descriptor(
    dataset_id: str,
    output_path: str | Path,
    path: str | Path | None = None,
) -> Path:
    """Export one local descriptor as JSON for sharing."""
    descriptor = next(
        (entry for entry in load_local_descriptors(path) if entry.id == dataset_id),
        None,
    )

    if descriptor is None:
        runtime_descriptor = DatasetRegistry.get(dataset_id)
        if (
            runtime_descriptor is not None
            and runtime_descriptor.geoparquet_uri
            and runtime_descriptor.spatial_coverage == "local"
        ):
            descriptor = runtime_descriptor

    if descriptor is None:
        raise KeyError(f"Local dataset '{dataset_id}' not found.")

    destination = Path(output_path).expanduser()
    destination.parent.mkdir(parents=True, exist_ok=True)
    destination.write_text(
        json.dumps(asdict(descriptor), indent=2, sort_keys=True) + "\n",
        encoding="utf-8",
    )
    return destination

rasteret.catalog¶

catalog ¶

Classes¶

DatasetDescriptor dataclass ¶

DatasetRegistry ¶

Functions¶

register classmethod ¶

unregister classmethod ¶

get classmethod ¶

list classmethod ¶

search classmethod ¶

Functions¶

load_local_descriptors ¶

save_local_descriptor ¶

remove_local_descriptor ¶

unregister_local_descriptor ¶

export_local_descriptor ¶

DatasetDescriptor `dataclass` ¶

register `classmethod` ¶

unregister `classmethod` ¶

get `classmethod` ¶

list `classmethod` ¶

search `classmethod` ¶