rasteret.core.collection¶
The central Collection class: Arrow dataset wrapper with filtering, output adapters, and persistence.
collection
¶
Classes¶
Collection
¶
Collection(
dataset: Dataset | None = None,
name: str = "",
description: str = "",
data_source: str = "",
start_date: datetime | None = None,
end_date: datetime | None = None,
)
A collection of raster data with flexible initialization.
Collections can be created from: - Local partitioned datasets - Single Arrow tables
Collections maintain efficient partitioned storage when using files.
Examples:
From partitioned dataset¶
Filter and process¶
Initialize a Collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
Backing Arrow dataset. |
None
|
name
|
str
|
Human-readable collection name. |
''
|
description
|
str
|
Free-text description. |
''
|
data_source
|
str
|
Data source identifier (e.g. |
''
|
start_date
|
datetime
|
Collection temporal start. |
None
|
end_date
|
datetime
|
Collection temporal end. |
None
|
Source code in src/rasteret/core/collection.py
Attributes¶
bounds
property
¶
Spatial extent as (minx, miny, maxx, maxy) or None.
Functions¶
from_parquet
classmethod
¶
from_parquet(
path: str | Path, name: str = ""
) -> Collection
Load a Collection from any Parquet file or directory.
Accepts local paths and cloud URIs (s3://, gs://).
Tries Hive-style partitioning first (year/month), falls back to
plain Parquet. Validates that the core contract columns are present.
See the Schema Contract <../explanation/schema-contract/>_ docs page.
Source code in src/rasteret/core/collection.py
subset
¶
subset(
*,
cloud_cover_lt: float | None = None,
date_range: tuple[str, str] | None = None,
bbox: tuple[float, float, float, float] | None = None,
geometries: Any = None,
split: str | Sequence[str] | None = None,
split_column: str = "split",
) -> Collection
Return a filtered view of this Collection.
All provided criteria are combined with AND.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cloud_cover_lt
|
float
|
Keep records with |
None
|
date_range
|
tuple of str
|
|
None
|
bbox
|
tuple of float
|
|
None
|
geometries
|
bbox tuple, pa.Array, Shapely, WKB bytes, or GeoJSON dict
|
Spatial filter; records whose bbox overlaps any geometry are kept.
Accepts |
None
|
split
|
str or sequence of str
|
Keep only rows matching the given split value(s). |
None
|
split_column
|
str
|
Column name holding split labels. Defaults to |
'split'
|
Returns:
| Type | Description |
|---|---|
Collection
|
A new Collection with the filtered dataset view. |
Source code in src/rasteret/core/collection.py
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 | |
select_split
¶
select_split(
split: str | Sequence[str],
*,
split_column: str = "split",
) -> Collection
Return a split-filtered view of this Collection.
This is a convenience wrapper around subset(split=...) to keep the
intent obvious in training code.
Source code in src/rasteret/core/collection.py
where
¶
where(expr: Expression) -> Collection
Return a filtered view using a raw Arrow dataset expression.
list_collections
classmethod
¶
List cached collections with summary metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workspace_dir
|
Path
|
Directory to scan for cached collections. Defaults to
|
None
|
Returns:
| Type | Description |
|---|---|
list of dict
|
Each dict contains |
Source code in src/rasteret/core/collection.py
454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 | |
export
¶
Export the collection as a partitioned Parquet dataset.
Use this to produce a portable copy of the collection that can
be shared with teammates via :func:rasteret.load.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str or Path
|
Output directory. Accepts local paths and cloud URIs
( |
required |
partition_by
|
sequence of str
|
Columns to partition by. Defaults to |
('year', 'month')
|
Source code in src/rasteret/core/collection.py
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 | |
iterate_rasters
async
¶
iterate_rasters(
data_source: str | None = None,
) -> AsyncIterator[RasterAccessor]
Iterate through raster records in this Collection.
Each Parquet row becomes a :class:RasterAccessor that provides
async band-loading methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_source
|
str
|
Data source identifier for band mapping. Defaults to
|
None
|
Yields:
| Type | Description |
|---|---|
RasterAccessor
|
|
Source code in src/rasteret/core/collection.py
get_first_raster
async
¶
get_first_raster() -> RasterAccessor
Return the first raster record in the collection.
Returns:
| Type | Description |
|---|---|
RasterAccessor
|
|
Source code in src/rasteret/core/collection.py
describe
¶
Summary of this collection.
Returns a :class:~rasteret.core.display.DescribeResult that renders
as a clean table in terminals and as styled HTML in notebooks
(Jupyter, marimo, Colab).
The underlying data is accessible via .data or ["key"].
Examples:
>>> collection.describe() # pretty table in REPL
>>> collection.describe()["bands"] # programmatic access
>>> collection.describe().data # full dict
Source code in src/rasteret/core/collection.py
compare_to_catalog
¶
Compare this collection against its catalog source.
Shows collection properties side-by-side with the catalog entry (bands coverage, date range vs source range, spatial coverage, auth requirements).
Raises :class:ValueError if the collection has no catalog match.
Returns a :class:~rasteret.core.display.DescribeResult that renders
as a table in terminals and styled HTML in notebooks.
Examples:
>>> collection.compare_to_catalog() # pretty comparison table
>>> collection.compare_to_catalog().data # full dict with catalog info
Source code in src/rasteret/core/collection.py
create_name
classmethod
¶
Create a standardized collection name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
custom_name
|
str
|
User-chosen name component. Underscores are normalised to dashes. |
required |
date_range
|
tuple of str
|
|
required |
data_source
|
str
|
Data source identifier (e.g. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Name in the format |
Source code in src/rasteret/core/collection.py
parse_name
classmethod
¶
Parse a standardized collection name into its components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Collection name created by :meth: |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Keys: |
Source code in src/rasteret/core/collection.py
to_torchgeo_dataset
¶
to_torchgeo_dataset(
*,
bands: list[str],
chip_size: int | None = None,
is_image: bool = True,
allow_resample: bool = False,
split: str | Sequence[str] | None = None,
split_column: str = "split",
label_field: str | None = None,
geometries: Any = None,
geometries_crs: int = 4326,
transforms: Any = None,
max_concurrent: int = 50,
cloud_config: Any = None,
backend: Any = None,
time_series: bool = False,
target_crs: int | None = None,
) -> RasteretGeoDataset
Create a TorchGeo GeoDataset backed by this Collection.
This integration is optional and requires torchgeo and its
dependencies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bands
|
list of str
|
Band codes to load (e.g. |
required |
chip_size
|
int
|
Spatial extent of each chip in pixels. |
None
|
is_image
|
bool
|
If |
True
|
allow_resample
|
bool
|
If |
False
|
split
|
str or sequence of str
|
Filter to the given split(s) before creating the dataset. |
None
|
split_column
|
str
|
Column holding split labels. Defaults to |
'split'
|
label_field
|
str
|
Column name to include as |
None
|
geometries
|
bbox tuple, pa.Array, Shapely, WKB bytes, or GeoJSON dict
|
Spatial extent for the dataset. Accepts |
None
|
geometries_crs
|
int
|
EPSG code for geometries. Defaults to |
4326
|
transforms
|
callable
|
TorchGeo-compatible transforms applied to each sample. |
None
|
max_concurrent
|
int
|
Maximum concurrent HTTP requests. |
50
|
cloud_config
|
CloudConfig
|
Cloud configuration for URL rewriting. |
None
|
backend
|
StorageBackend
|
Pluggable I/O backend (e.g. |
None
|
time_series
|
bool
|
When |
False
|
target_crs
|
int
|
Reproject all records to this EPSG code at read time. |
None
|
Returns:
| Type | Description |
|---|---|
RasteretGeoDataset
|
A standard TorchGeo |
Source code in src/rasteret/core/collection.py
979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 | |
get_xarray
¶
get_xarray(
geometries: Any,
bands: list[str],
*,
max_concurrent: int = 50,
cloud_config: Any = None,
data_source: str | None = None,
backend: Any = None,
target_crs: int | None = None,
**filters: Any,
) -> Dataset
Load selected bands into an xarray Dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
geometries
|
bbox tuple, pa.Array, Shapely, WKB bytes, or GeoJSON dict
|
Area(s) of interest to load. Accepts |
required |
bands
|
list of str
|
Band codes to load. |
required |
max_concurrent
|
int
|
Maximum concurrent HTTP requests. |
50
|
cloud_config
|
CloudConfig
|
Cloud configuration for URL rewriting. |
None
|
data_source
|
str
|
Override the inferred data source. |
None
|
backend
|
StorageBackend
|
Pluggable I/O backend. |
None
|
target_crs
|
int
|
Reproject all records to this CRS before merging. |
None
|
filters
|
kwargs
|
Additional keyword arguments passed to :meth: |
{}
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Band arrays in native COG dtype (e.g. |
Source code in src/rasteret/core/collection.py
get_gdf
¶
get_gdf(
geometries: Any,
bands: list[str],
*,
max_concurrent: int = 50,
cloud_config: Any = None,
data_source: str | None = None,
backend: Any = None,
target_crs: int | None = None,
**filters: Any,
) -> GeoDataFrame
Load selected bands into a GeoDataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
geometries
|
bbox tuple, pa.Array, Shapely, WKB bytes, or GeoJSON dict
|
Area(s) of interest to load. Accepts |
required |
bands
|
list of str
|
Band codes to load. |
required |
max_concurrent
|
int
|
Maximum concurrent HTTP requests. |
50
|
cloud_config
|
CloudConfig
|
Cloud configuration for URL rewriting. |
None
|
data_source
|
str
|
Override the inferred data source. |
None
|
backend
|
StorageBackend
|
Pluggable I/O backend. |
None
|
target_crs
|
int
|
Reproject all records to this CRS before building the GeoDataFrame. |
None
|
filters
|
kwargs
|
Additional keyword arguments passed to :meth: |
{}
|
Returns:
| Type | Description |
|---|---|
GeoDataFrame
|
Band arrays in native COG dtype. Each row is a geometry-record pair with pixel data as columns. |