noether.data.zarr_store.statistics¶
Calculate per-field statistics of a converted Zarr store.
The Zarr-store counterpart of noether.data.tools.calculate_statistics: instead of
going through a dataset class, it streams every sample’s fields straight from the store
(local path or fsspec URL such as oci://bucket@namespace/zarr_store) and accumulates
running moments in a single pass. Linear and logscale moments are computed together via
RunningStats, so one run yields everything a stats.yaml
needs ({field}_mean/std/min/max plus {field}_logscale_mean/std and the global
raw_pos_min/raw_pos_max position bounds).
Usage:
OCIFS_IAM_TYPE=api_key uv run python -m noether.data.zarr_store.statistics \
--store oci://emmi-drivaernet@frwnorq7ern2/zarr_store \
--split-file train_design_ids.txt \
--workers 8 --read-concurrency 4 --output-json drivaernet_stats.json
Functions¶
|
Read sample ids (one per line) from a split file. |
|
Stream all samples of a Zarr store and accumulate per-field running statistics. |
|
Flatten running statistics into |
|
Print the accumulated statistics per field, plus the global position bounds. |
|
Save the flattened statistics (see |
|
CLI entry point for calculating Zarr store statistics. |
Module Contents¶
- noether.data.zarr_store.statistics.read_split_ids(store_root, split_file)¶
Read sample ids (one per line) from a split file.
split_filemay be an absolute path/URL or a name relative to the store root (e.g.train_design_ids.txtnext tomanifest.json).- Parameters:
store_root (str | pathlib.Path)
split_file (str)
- Return type:
- noether.data.zarr_store.statistics.calculate_store_statistics(store_root, *, fields=None, exclude_fields=None, sample_ids=None, limit=None, max_workers=1, read_concurrency=1, progress=False)¶
Stream all samples of a Zarr store and accumulate per-field running statistics.
- Parameters:
store_root (str | pathlib.Path) – Store root (local path or fsspec URL).
fields (set[str] | None) – Restrict to these canonical field names (default: every stored field).
sample_ids (list[str] | None) – Restrict to these manifest sample ids (default: all samples).
limit (int | None) – Process at most this many samples (after
sample_idsfiltering).max_workers (int) – Samples read concurrently (threads); accumulation stays single-threaded.
read_concurrency (int) – Per-sample chunk-read threads (see
ZarrChunkReader).progress (bool) – Show a tqdm progress bar.
- Returns:
Mapping from canonical field name to its
RunningStats(per-component mean/std/min/max and logscale moments, accumulated in float64).- Raises:
ValueError – If a requested field is not present in the store, or no samples remain to process. Sample ids missing from the store are skipped with a warning (split files may list samples that were skipped at conversion).
- Return type:
- noether.data.zarr_store.statistics.statistics_to_dict(running_stats)¶
Flatten running statistics into
stats.yaml-style keys.Emits
{field}_mean/std/min/max/countand{field}_logscale_mean/stdper field, plus globalraw_pos_min/raw_pos_maxscalars over all*_positionfields (the bounds used by position normalization).
- noether.data.zarr_store.statistics.print_statistics(running_stats)¶
Print the accumulated statistics per field, plus the global position bounds.
- Parameters:
running_stats (dict[str, noether.data.stats.RunningStats])
- Return type:
None
- noether.data.zarr_store.statistics.save_statistics_to_json(running_stats, output_path)¶
Save the flattened statistics (see
statistics_to_dict()) as JSON.- Parameters:
running_stats (dict[str, noether.data.stats.RunningStats])
output_path (str | pathlib.Path)
- Return type:
None
- noether.data.zarr_store.statistics.main()¶
CLI entry point for calculating Zarr store statistics.
- Return type:
None