noether.data.zarr_store.writer¶
Writer that converts per-sample CFD tensors into a sharded, pre-shuffled Zarr store.
Each sample becomes an independent Zarr group (<store_root>/<sample_id>.zarr)
holding one array per field (surface/position, volume/velocity, …), so
fields can be read independently. Points are shuffled once at write time with a
deterministic, per-sample seed so that any contiguous chunk is already a uniform-random
subset of the sample — this lets the dataloader turn “sample N random points” into
“read a random chunk”. All arrays of a domain share the permutation and chunk grid, so
chunk c is point-aligned across fields.
Arrays are chunked along the point axis (chunk_points) with the channel axis left
unchunked, and packed into a single whole-array shard compressed per-chunk with
blosc+zstd — the per-sample object count therefore stays at one object per field.
Classes¶
Convert CFD samples into the chunked/sharded Zarr format and track the manifest. |
Module Contents¶
- class noether.data.zarr_store.writer.ZarrStoreWriter(store_root, filemap, dataset_name, shuffle_seed=0, chunk_points=4096, shard_points=None, coords_dtype='float32', values_dtype='float16', field_dtypes=None, compression_level=5)¶
Convert CFD samples into the chunked/sharded Zarr format and track the manifest.
- Parameters:
store_root (str | pathlib.Path) – Output location for the Zarr store. A local path or an fsspec URL (
s3://,gs://,memory://, …) for object storage.filemap (noether.data.schemas.FileMap) – Field-to-filename mapping describing which fields exist.
dataset_name (str) – Human-readable dataset name recorded in the manifest.
shuffle_seed (int) – Base seed for the per-sample point shuffle.
chunk_points (int) – Chunk size along the point axis. Pick close to the training subsample size to minimise read amplification.
shard_points (int | None) – Cap on the shard size along the point axis (rounded down to a whole number of chunks, minimum one chunk).
None(default) packs each array into a single whole-array shard. Set this when per-field arrays grow large: shard bytes ≈shard_points × dim × dtype_size, so e.g. a ~128 MB cap on a float32×3 position array isshard_points ≈ 11_000_000. Smaller shards bound the writer’s per-shard RAM and the blast radius of a corrupt object, at the cost of more objects per array.coords_dtype (str) – Dtype for the positions array (keep float32).
values_dtype (str) – Dtype for the physical fields array (float16 halves bytes).
field_dtypes (dict[str, str] | None) – Per-field dtype overrides keyed by canonical name, e.g.
{"volume_vorticity": "float32"}for fields whose values exceed thevalues_dtyperange (float16 caps at ~6.6e4); overflowing casts are rejected at write time rather than silently stored asinf.compression_level (int) – blosc/zstd compression level.
- store_root = ''¶
- filemap¶
- chunk_points = 4096¶
- shard_points = None¶
- coords_dtype = 'float32'¶
- values_dtype = 'float16'¶
- field_dtypes = None¶
- compression_level = 5¶
- layouts¶
- manifest¶
- write_group(sample_id, field_arrays)¶
Write one sample’s Zarr group and return its manifest entry (no manifest mutation).
Independent per sample (its own store), so this is safe to call concurrently from multiple threads; the caller records the returned entry in the manifest.
- Parameters:
- Returns:
The
SampleEntrydescribing the written group.- Raises:
ValueError – If a domain’s fields disagree on point count.
- Return type:
- write_sample(sample_id, field_arrays)¶
Write one sample and record it in the manifest (sequential convenience).
- to_init_kwargs()¶
Constructor kwargs to rebuild an identical writer (e.g. in a worker process).