noether.data.zarr_store.reader¶
Chunk-sampling reader for the Zarr CFD store.
The reader turns “draw T random points from a sample” into “read a few random
chunks”. Because points were shuffled once at write time, any chunk is already a
uniform-random subset, so the reader only fetches ceil(T / chunk_points) chunks
per array instead of the whole sample. Reading a chunk from a sharded array is a
byte-range request, so the bytes transferred scale with T and not with the
sample size.
The per-sample chunk reads are independent, so they can be issued concurrently. Set
read_concurrency > 1 to fetch a sample’s chunks (across both arrays and both
domains) in a thread pool — this hides per-request latency on high-latency stores
(e.g. S3) and is a no-op cost on fast local stores, where the default of 1 keeps
reads serial.
Classes¶
Reads chunk-subsampled (or full) per-field tensors from a converted store. |
Module Contents¶
- class noether.data.zarr_store.reader.ZarrChunkReader(store_root, manifest=None, store_factory=_default_store_factory, read_concurrency=1)¶
Reads chunk-subsampled (or full) per-field tensors from a converted store.
- Parameters:
store_root (str | pathlib.Path) – Root of the converted Zarr store — a local path or an fsspec URL (
s3://,gs://,memory://, …) for object storage.manifest (noether.data.zarr_store.manifest.StoreManifest | None) – Pre-loaded manifest; loaded from
store_rootif omitted.store_factory (collections.abc.Callable[[str], zarr.abc.store.Store]) – Builds a Zarr store from a sample-group path/URL. Defaults to a read-only LocalStore/FsspecStore by path; override to wrap the store (e.g. for byte-accounting in benchmarks).
read_concurrency (int) – Max threads used to fetch a sample’s chunks in parallel.
1(default) reads serially — best for fast local stores. Use a value around the number of chunks per sample to hide latency on S3.
- store_root = ''¶
- manifest¶
- read_concurrency¶
- read_sample(sample_id, num_points=None, generator=None, fields=None)¶
Read (optionally chunk-subsampled) fields for one sample.
With
read_concurrency == 1each array is read with a single orthogonal selection (lowest overhead, best on local stores). Withread_concurrency > 1all chunk reads of the sample are issued concurrently in a thread pool (best latency hiding on S3). Both paths return identical results.- Parameters:
sample_id (str) – Sample id present in the manifest.
num_points (dict[str, int | None] | None) – Per-domain target counts, e.g.
{"surface": 3586, "volume": 4096}. ANonevalue (or missing domain) reads the full domain.generator (torch.Generator | None) – Torch RNG for chunk selection. Pass a seeded generator for deterministic evaluation;
Nonedraws a fresh subset each call.fields (set[str] | None) – Optional subset of canonical fields to read. Because every field is its own array, unrequested fields cost no I/O.
Nonereads all fields.
- Returns:
Mapping
canonical_field -> tensorwith scalar fields shaped(T,)and vector fields shaped(T, dim), all float32 and point-aligned per domain.- Return type:
- read_coords(sample_id, domain, num_points, generator=None)¶
Read an independent chunk-subsample of a domain’s point positions only.
This is a separate random draw from
read_sample()(it consumes the same generator, so it is uncorrelated with the field draw), reading just the domain’s position array. It backs the AB-UPTgeometry_positioninput — a random draw of the surface points distinct from the surface anchor points.- Parameters:
sample_id (str) – Sample id present in the manifest.
domain (str) – Domain whose positions to read (e.g.
"surface").num_points (int | None) – Number of points to sample (
Nonereads all positions).generator (torch.Generator | None) – Torch RNG for chunk selection.
- Returns:
(num_points, position_dim)float32 positions tensor.- Return type: