noether.data.zarr_store.statistics

Calculate per-field statistics of a converted Zarr store.

The Zarr-store counterpart of noether.data.tools.calculate_statistics: instead of going through a dataset class, it streams every sample’s fields straight from the store (local path or fsspec URL such as oci://bucket@namespace/zarr_store) and accumulates running moments in a single pass. Linear and logscale moments are computed together via RunningStats, so one run yields everything a stats.yaml needs ({field}_mean/std/min/max plus {field}_logscale_mean/std and the global raw_pos_min/raw_pos_max position bounds).

Usage:

OCIFS_IAM_TYPE=api_key uv run python -m noether.data.zarr_store.statistics \
    --store oci://emmi-drivaernet@frwnorq7ern2/zarr_store \
    --split-file train_design_ids.txt \
    --workers 8 --read-concurrency 4 --output-json drivaernet_stats.json

Functions

read_split_ids(store_root, split_file)

Read sample ids (one per line) from a split file.

calculate_store_statistics(store_root, *[, fields, ...])

Stream all samples of a Zarr store and accumulate per-field running statistics.

statistics_to_dict(running_stats)

Flatten running statistics into stats.yaml-style keys.

print_statistics(running_stats)

Print the accumulated statistics per field, plus the global position bounds.

save_statistics_to_json(running_stats, output_path)

Save the flattened statistics (see statistics_to_dict()) as JSON.

main()

CLI entry point for calculating Zarr store statistics.

Module Contents

noether.data.zarr_store.statistics.read_split_ids(store_root, split_file)

Read sample ids (one per line) from a split file.

split_file may be an absolute path/URL or a name relative to the store root (e.g. train_design_ids.txt next to manifest.json).

Parameters:
Return type:

list[str]

noether.data.zarr_store.statistics.calculate_store_statistics(store_root, *, fields=None, exclude_fields=None, sample_ids=None, limit=None, max_workers=1, read_concurrency=1, progress=False)

Stream all samples of a Zarr store and accumulate per-field running statistics.

Parameters:
  • store_root (str | pathlib.Path) – Store root (local path or fsspec URL).

  • fields (set[str] | None) – Restrict to these canonical field names (default: every stored field).

  • exclude_fields (set[str] | None) – Field names to skip.

  • sample_ids (list[str] | None) – Restrict to these manifest sample ids (default: all samples).

  • limit (int | None) – Process at most this many samples (after sample_ids filtering).

  • max_workers (int) – Samples read concurrently (threads); accumulation stays single-threaded.

  • read_concurrency (int) – Per-sample chunk-read threads (see ZarrChunkReader).

  • progress (bool) – Show a tqdm progress bar.

Returns:

Mapping from canonical field name to its RunningStats (per-component mean/std/min/max and logscale moments, accumulated in float64).

Raises:

ValueError – If a requested field is not present in the store, or no samples remain to process. Sample ids missing from the store are skipped with a warning (split files may list samples that were skipped at conversion).

Return type:

dict[str, noether.data.stats.RunningStats]

noether.data.zarr_store.statistics.statistics_to_dict(running_stats)

Flatten running statistics into stats.yaml-style keys.

Emits {field}_mean/std/min/max/count and {field}_logscale_mean/std per field, plus global raw_pos_min/raw_pos_max scalars over all *_position fields (the bounds used by position normalization).

Parameters:

running_stats (dict[str, noether.data.stats.RunningStats])

Return type:

dict[str, list[float] | int]

noether.data.zarr_store.statistics.print_statistics(running_stats)

Print the accumulated statistics per field, plus the global position bounds.

Parameters:

running_stats (dict[str, noether.data.stats.RunningStats])

Return type:

None

noether.data.zarr_store.statistics.save_statistics_to_json(running_stats, output_path)

Save the flattened statistics (see statistics_to_dict()) as JSON.

Parameters:
Return type:

None

noether.data.zarr_store.statistics.main()

CLI entry point for calculating Zarr store statistics.

Return type:

None