noether.data.zarr_store.convert¶
Convert a per-field .pt CFD dataset into the chunked/sharded Zarr store.
Works for any FileMap-based AeroDataset
(ShapeNet-Car, DrivAerML, AhmedML, DrivAerNet, …) from two kinds of sources:
--root— a local/NFS dataset root: the dataset class itself supplies theFileMap, the per-split sample lists and the directory layout (via_load).--source-url— any fsspec location (oci://bucket@namespace/prefix,s3://bucket/prefix, a plain directory, …): samples are discovered by listing the prefix and grouping.ptfiles by directory, then streamed without a local copy.
Samples are converted in a ProcessPoolExecutor so the
GIL-bound parts (torch.load deserialization, numpy shuffles) parallelize fully;
each worker process builds its own source handle and writer, and the manifest is
assembled in the parent from the returned entries (bit-identical to a sequential run —
per-sample shuffles are seeded by sample id).
Usage:
uv run python -m noether.data.zarr_store.convert --dataset-kind noether.data.datasets.cfd.DrivAerNetDataset --source-url oci://emmi-drivaernet@<namespace>/subsampled_volume10x --output /data/drivaernet/zarr_store --chunk-points 16384 --workers 16
Attributes¶
Functions¶
|
Instantiate a dataset |
|
Resolve a dataset kind's |
|
List source_url and group its |
|
Write every sample of a FileMap-based dataset into writer (does not save the manifest). |
|
Stream-convert a remote (or local) |
|
Convert a FileMap dataset into a single Zarr store. |
|
Module Contents¶
- noether.data.zarr_store.convert.logger¶
- noether.data.zarr_store.convert.FsspecTask¶
- noether.data.zarr_store.convert.build_dataset(kind, root, split)¶
Instantiate a dataset
kindfor one split (raw: no normalizers/wrappers).- Parameters:
- Return type:
- noether.data.zarr_store.convert.filemap_for_dataset_kind(kind)¶
Resolve a dataset kind’s
FileMapwithout instantiating the dataset.Looks for, in order: a
FILEMAPclass attribute, afilemap__init__default anywhere in the MRO, and finally aFileMapinstance in the dataset’s module globals.- Parameters:
kind (str)
- Return type:
- noether.data.zarr_store.convert.discover_fsspec_samples(source_url, filemap, limit=None)¶
List source_url and group its
.ptfiles into per-sample conversion tasks.A sample is any directory (prefix) containing every field of filemap; incomplete samples are skipped with a warning. Sample ids are the directory paths relative to the prefix (e.g.
"E_S_WWC_WM_005"or"param1/<hash>"), matching the ids the dataset-driven path produces.- Parameters:
source_url (str)
filemap (noether.data.schemas.FileMap)
limit (int | None)
- Return type:
list[FsspecTask]
- noether.data.zarr_store.convert.convert_aero_dataset(dataset, writer, *, max_workers=1, limit=None)¶
Write every sample of a FileMap-based dataset into writer (does not save the manifest).
- Parameters:
dataset (noether.data.datasets.cfd.dataset.AeroDataset) – A FileMap-based
AeroDatasetinstance (one split).writer (noether.data.zarr_store.writer.ZarrStoreWriter) – Target store writer; reused across splits to build one store.
max_workers (int) – Worker processes converting samples concurrently; the dataset and a writer clone are set up once per process.
limit (int | None) – If set, convert at most this many samples (debugging).
- Returns:
The same
writer(callZarrStoreWriter.save_manifest()once when done).- Raises:
ValueError – If the dataset does not expose a
filemap.- Return type:
- noether.data.zarr_store.convert.convert_fsspec_source(source_url, filemap, writer, *, max_workers=1, limit=None)¶
Stream-convert a remote (or local)
.pttree reachable through fsspec.Samples are discovered once in the parent by listing source_url; each worker process opens its own filesystem handle and reads sample files directly from the source (no local staging copy).
- Parameters:
source_url (str) – fsspec location of the
.pttree (oci://,s3://, a path, …).filemap (noether.data.schemas.FileMap) – Field-to-filename mapping of the dataset (see
filemap_for_dataset_kind()).writer (noether.data.zarr_store.writer.ZarrStoreWriter) – Target store writer.
max_workers (int) – Worker processes converting samples concurrently.
limit (int | None) – If set, convert at most this many samples (debugging/benchmarks).
- Returns:
The same
writer(callZarrStoreWriter.save_manifest()once when done).- Raises:
RuntimeError – If no complete samples are found under source_url.
- Return type:
- noether.data.zarr_store.convert.convert(dataset_kind, store_root, *, root=None, source_url=None, splits=None, dataset_name=None, chunk_points=4096, shard_points=None, shuffle_seed=0, values_dtype='float16', field_dtypes=None, limit=None, max_workers=1)¶
Convert a FileMap dataset into a single Zarr store.
Exactly one of root (local dataset-driven, per split) or source_url (fsspec streaming, all samples under the prefix) must be given.
- Parameters:
dataset_kind (str) – Fully-qualified dataset class path (supplies the
FileMap).store_root (str) – Output Zarr store — a local path or fsspec URL (
s3://,gs://, …).root (str | None) – Local dataset root (dataset-driven source).
source_url (str | None) – fsspec source prefix (streaming source;
splitsis ignored).splits (list[str] | None) – Splits to convert for the dataset-driven source (default train/test/val); unsupported/empty ones are skipped with a warning.
dataset_name (str | None) – Manifest name (defaults to the dataset class name).
chunk_points (int) – Chunk size along the point axis (pick close to the train subsample size).
shard_points (int | None) – Cap on the shard size along the point axis (
None= one shard per array).shuffle_seed (int) – Base seed for the per-sample point shuffle.
values_dtype (str) – Dtype for the physical field arrays.
field_dtypes (dict[str, str] | None) – Per-field dtype overrides, e.g.
{"volume_vorticity": "float32"}for fields whose values exceed the float16 range.limit (int | None) – If set, convert at most this many samples per split / source.
max_workers (int) – Worker processes converting samples concurrently.
- Returns:
The writer with the saved manifest.
- Raises:
ValueError – If not exactly one of root / source_url is given.
RuntimeError – If nothing could be converted.
- Return type:
- noether.data.zarr_store.convert.main()¶
- Return type:
None