noether.data.zarr_store.statistics¶

Calculate per-field statistics of a converted Zarr store.

The Zarr-store counterpart of noether.data.tools.calculate_statistics: instead of going through a dataset class, it streams every sample’s fields straight from the store (local path or fsspec URL such as oci://bucket@namespace/zarr_store) and accumulates running moments in a single pass. Linear and logscale moments are computed together via RunningStats, so one run yields everything a stats.yaml needs ({field}_mean/std/min/max plus {field}_logscale_mean/std and the global raw_pos_min/raw_pos_max position bounds).

Usage:

OCIFS_IAM_TYPE=api_key uv run python -m noether.data.zarr_store.statistics \
    --store oci://emmi-drivaernet@frwnorq7ern2/zarr_store \
    --split-file train_design_ids.txt \
    --workers 8 --read-concurrency 4 --output-json drivaernet_stats.json

Functions¶

`read_split_ids`(store_root, split_file)	Read sample ids (one per line) from a split file.
`calculate_store_statistics`(store_root, *[, fields, ...])	Stream all samples of a Zarr store and accumulate per-field running statistics.
`statistics_to_dict`(running_stats)	Flatten running statistics into `stats.yaml`-style keys.
`print_statistics`(running_stats)	Print the accumulated statistics per field, plus the global position bounds.
`save_statistics_to_json`(running_stats, output_path)	Save the flattened statistics (see `statistics_to_dict()`) as JSON.
`main`()	CLI entry point for calculating Zarr store statistics.

Module Contents¶

noether.data.zarr_store.statistics.read_split_ids(store_root, split_file)¶

Read sample ids (one per line) from a split file.

split_file may be an absolute path/URL or a name relative to the store root (e.g. train_design_ids.txt next to manifest.json).

Parameters:

store_root (str | pathlib.Path)
split_file (str)

Return type:

list[str]

noether.data.zarr_store.statistics.calculate_store_statistics(store_root, *, fields=None, exclude_fields=None, sample_ids=None, limit=None, max_workers=1, read_concurrency=1, progress=False)¶

Stream all samples of a Zarr store and accumulate per-field running statistics.

Parameters:

store_root (str | pathlib.Path) – Store root (local path or fsspec URL).
fields (set[str] | None) – Restrict to these canonical field names (default: every stored field).
exclude_fields (set[str] | None) – Field names to skip.
sample_ids (list[str] | None) – Restrict to these manifest sample ids (default: all samples).
limit (int | None) – Process at most this many samples (after sample_ids filtering).
max_workers (int) – Samples read concurrently (threads); accumulation stays single-threaded.
read_concurrency (int) – Per-sample chunk-read threads (see ZarrChunkReader).
progress (bool) – Show a tqdm progress bar.

Returns:

Mapping from canonical field name to its RunningStats (per-component mean/std/min/max and logscale moments, accumulated in float64).

Raises:

ValueError – If a requested field is not present in the store, or no samples remain to process. Sample ids missing from the store are skipped with a warning (split files may list samples that were skipped at conversion).

Return type:

dict[str, noether.data.stats.RunningStats]

noether.data.zarr_store.statistics.statistics_to_dict(running_stats)¶

Flatten running statistics into stats.yaml-style keys.

Emits {field}_mean/std/min/max/count and {field}_logscale_mean/std per field, plus global raw_pos_min/raw_pos_max scalars over all *_position fields (the bounds used by position normalization).

Parameters:: running_stats (dict[str, noether.data.stats.RunningStats])
Return type:: dict[str, list[float] | int]

noether.data.zarr_store.statistics.print_statistics(running_stats)¶

Print the accumulated statistics per field, plus the global position bounds.

Parameters:: running_stats (dict[str, noether.data.stats.RunningStats])
Return type:: None

noether.data.zarr_store.statistics.save_statistics_to_json(running_stats, output_path)¶

Save the flattened statistics (see statistics_to_dict()) as JSON.

Parameters:

running_stats (dict[str, noether.data.stats.RunningStats])
output_path (str | pathlib.Path)

Return type:

None

noether.data.zarr_store.statistics.main()¶

CLI entry point for calculating Zarr store statistics.

Return type:: None