rasteret.ingest.parquet_record_table¶
Parquet record-table driver for building Collections from tabular indexes.
parquet_record_table
¶
Record-table collection builder.
Reads a Parquet/GeoParquet record table (one row per raster item) and
normalizes it into a :class:~rasteret.core.collection.Collection via
:func:~rasteret.ingest.normalize.build_collection_from_table.
Terminology
- Record table -- a tabular index that enumerates raster items (satellite scenes, drone images, derived products, grid cells, etc.). It may come from stac-geoparquet, a lab-specific registry, or a custom export.
- Collection Parquet -- Rasteret's normalized, runtime-ready Parquet dataset
that follows the
Schema Contract <explanation/schema-contract>_ docs page.
Classes¶
RecordTableBuilder
¶
RecordTableBuilder(
path: str | Path,
*,
data_source: str = "",
column_map: dict[str, str] | None = None,
href_column: str | None = None,
band_index_map: dict[str, int] | None = None,
url_rewrite_patterns: dict[str, str] | None = None,
filesystem: Any | None = None,
columns: list[str] | None = None,
filter_expr: Expression | None = None,
name: str = "",
workspace_dir: str | Path | None = None,
enrich_cog: bool = False,
band_codes: list[str] | None = None,
max_concurrent: int = 300,
backend: StorageBackend | None = None,
)
Bases: CollectionBuilder
Build a Collection from an existing Parquet/GeoParquet table.
Reads a Parquet record table where each row is a raster item
with at minimum the four contract columns (id, datetime,
geometry, assets), or columns that can be normalised into
them via column_map, href_column, and band_index_map.
When enrich_cog=True, the builder parses COG headers from the
asset URLs and adds {band}_metadata struct columns, making
the resulting Collection suitable for fast tiled reads and TorchGeo
integration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str or Path
|
Path/URI to the Parquet/GeoParquet file or dataset directory. |
required |
data_source
|
str
|
Data-source identifier for the resulting Collection. |
''
|
column_map
|
dict
|
|
None
|
href_column
|
str
|
Column containing COG URLs. When set and |
None
|
band_index_map
|
dict
|
|
None
|
url_rewrite_patterns
|
dict
|
|
None
|
filesystem
|
FileSystem
|
PyArrow filesystem for reading remote URIs (e.g.
|
None
|
columns
|
list of str
|
Scan-time column projection. |
None
|
filter_expr
|
Expression
|
Scan-time predicate pushdown. |
None
|
enrich_cog
|
bool
|
If |
False
|
band_codes
|
list of str
|
Bands to enrich. If omitted, all bands found in the |
None
|
max_concurrent
|
int
|
Maximum concurrent HTTP connections for COG header parsing. |
300
|
name
|
str
|
Collection name. Passed through to the normalisation layer. |
''
|
workspace_dir
|
str or Path
|
If provided, persist the resulting Collection as Parquet here. |
None
|
backend
|
StorageBackend
|
I/O backend for authenticated range reads during COG header parsing. |
None
|
Source code in src/rasteret/ingest/parquet_record_table.py
Functions¶
build
¶
Read the record table and return a normalized Collection.
Pipeline: read -> alias -> prepare -> enrich -> normalize.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
|
{}
|
Returns:
| Type | Description |
|---|---|
Collection
|
|
Source code in src/rasteret/ingest/parquet_record_table.py
Functions¶
prepare_record_table
¶
prepare_record_table(
table: Table,
*,
href_column: str | None = None,
band_index_map: dict[str, int] | None = None,
url_rewrite_patterns: dict[str, str] | None = None,
) -> Table
Normalise column types and construct assets when absent.
This is a pure function (no instance state) so it can be used from both
:class:RecordTableBuilder and the in-memory build_from_table() path
without constructing a builder object.
Steps:
- Auto-coerce
id: integer -> string. - Auto-coerce
datetime: integer year -> timestamp. - Construct
assetsfrom href_column + band_index_map. - Derive
proj:epsgfrom acrscolumn when present.