Datasets¶

Object instantiation¶

The objects instantiated in the Noether Framework via configs use a factory pattern. The config of each object contains a kind field — the class path of the class to instantiate. The remaining variables are passed to the constructor via the config object created after Pydantic schema evaluation. For example, kind: noether.modeling.models.AeroTransformer indicates which model class to instantiate.

The Dataset¶

The Dataset class serves as the bridge between raw (or preprocessed) data stored on disk and the multi-stage pipeline that transforms individual samples into batches for model training (discussed in The Multi-Stage Pipeline). It defines how to load and access individual data tensors for each sample.

The Dataset enables:

Loading individual data samples from disk
Providing tensor-level data access through modular methods
Applying per-tensor normalization and transformations
Supporting flexible data loading for different model requirements

This walkthrough uses the pre-implemented ShapeNetCarDataset from the Noether package.

Dataset class hierarchy:

torch.utils.data.Dataset (PyTorch base)
    └── noether.data.Dataset (Noether base with getitem_* pattern)
          └── noether.data.datasets.cfd.AeroDataset (CFD aerodynamics API)
                └── ShapeNetCarDataset (ShapeNet-Car implementation)

The AeroDataset provides a general API for CFD aerodynamics datasets (AhmedML, DrivAerML, DrivAerNet++, ShapeNet-Car, etc.), ensuring consistent interfaces across different aerodynamics datasets.

For a concise guide on building your own dataset, see How to Implement a Custom Dataset.

The `getitem_*` pattern: Modular data loading¶

Traditional PyTorch datasets use a single __getitem__ method to load all data for a sample. This approach has several limitations:

Becomes complex when different models need different inputs from the same dataset
Difficult to selectively load subsets of data
Hard to maintain when adding new data fields
Forces loading unused data for some experiments

The Noether Framework uses a modular getitem_* pattern where each data tensor has its own dedicated loading method. This enables:

Modularity: Each method loads one specific tensor
Flexibility: Selectively load only required tensors via configuration
Maintainability: Easy to add new data fields without modifying existing code
Clarity: Self-documenting through method names (e.g., getitem_surface_pressure)

Example implementation:

def _load(self, idx: int, filename: str) -> torch.Tensor:
    """
    Loads a tensor from a file within a specific sample directory.

    Args:
        idx: Index of the sample to load.
        filename: Name of the file to load from the sample directory.

    Returns:
        The loaded tensor.
    """
    # Use modulo to handle dataset repetitions
    idx = idx % len(self.uris)
    sample_uri = self.uris[idx] / filename
    return torch.load(sample_uri, weights_only=True)

def getitem_surface_position(self, idx: int) -> torch.Tensor:
    """Retrieves surface position coordinates (num_surface_points, 3)."""
    return self._load(idx=idx, filename="surface_points.pt")

def getitem_surface_pressure(self, idx: int) -> torch.Tensor:
    """Retrieves surface pressure values (num_surface_points, 1)."""
    return self._load(idx=idx, filename="surface_pressure.pt").unsqueeze(1)

Design pattern:

Helper methods (e.g., _load) keep code DRY and handle common operations
Descriptive names make it clear what each method loads
Consistent signature: All getitem_* methods take idx and return a tensor
Tensor-level operations: Shape transformations (e.g., unsqueeze) applied immediately

ShapeNet-Car dataset structure¶

The ShapeNet-Car dataset contains CFD simulation data for 889 car geometries, with each data point consisting of preprocessed PyTorch tensors stored on disk.

Note

To download and preprocess the data, see the ShapeNet-Car dataset README.

Available data tensors:

Each simulation provides the following fields through corresponding getitem_* methods:

Tensor	Method	Shape	Description
Surface Position	`getitem_surface_position`	`(N_surf, 3)`	3D coordinates of surface mesh points
Surface Pressure	`getitem_surface_pressure`	`(N_surf, 1)`	Pressure values at surface points
Surface Normals	`getitem_surface_normals`	`(N_surf, 3)`	Normal vectors at surface points
Volume Position	`getitem_volume_position`	`(N_vol, 3)`	3D coordinates of volume mesh points
Volume Velocity	`getitem_volume_velocity`	`(N_vol, 3)`	Velocity vectors at volume points
Volume Normals	`getitem_volume_normals`	`(N_vol, 3)`	Normal vectors (pointing to nearest surface)
Volume SDF	`getitem_volume_sdf`	`(N_vol, 1)`	Signed Distance Field to nearest surface

Dataset configuration¶

Datasets in Noether are instantiated by the DatasetFactory, which uses configuration files to create dataset instances with appropriate settings.

Basic dataset configuration structure:

The configs/datasets/shapenet_dataset.yaml file defines dataset configurations for different splits:

train:
  root: ${dataset_root}
  kind: ${dataset_kind}
  split: train
  pipeline: ${pipeline}
  dataset_normalizers: ${dataset_normalizers}
  excluded_properties: ${excluded_properties}
test:
  root:  ${dataset_root}
  kind: ${dataset_kind}
  split: test
  pipeline: ${pipeline}
  dataset_normalizers: ${dataset_normalizers}
  excluded_properties: ${excluded_properties}
test_repeat:
  root:  ${dataset_root}
  kind: ${dataset_kind}
  split: test
  pipeline: ${pipeline}
  dataset_normalizers: ${dataset_normalizers}
  excluded_properties: ${excluded_properties}
  dataset_wrappers:
    - kind: noether.data.base.wrappers.RepeatWrapper
      repetitions: 10
    

Configuration parameters:

root: Path to the dataset directory on disk
kind: Full class path to the dataset class (e.g., noether.data.datasets.cfd.ShapeNetCarDataset)
split: Data split identifier (train, test, val, etc.) used by the dataset to select appropriate samples
pipeline: Reference to the multi-stage pipeline configuration
dataset_normalizers: Reference to tensor normalization configurations
excluded_properties: List of getitem_* methods to skip during data loading

The test_repeat section demonstrates multiple dataset configurations for different evaluation scenarios.

Dataset wrappers:

The RepeatWrapper loops over the dataset multiple times (10x in this example) to reduce variance during evaluation. Other useful wrappers include:

SubsetWrapper: Select specific indices from the dataset
ShuffleWrapper: Randomize sample order

This flexibility allows you to:

Use different pipelines for train vs. test datasets
Create multiple evaluation sets with different sampling strategies
Apply different normalizations to different splits

Selective data loading with `excluded_properties`¶

By default, all getitem_* methods are called when loading a sample. However, different models often require different input tensors. The excluded_properties configuration allows selective loading:

# Example: Exclude normal vectors for a model that doesn't use them
excluded_properties:
  - surface_normals
  - volume_normals

A point-based Transformer might only need positions, surface pressure, and volume velocity:

# Load only essential tensors
excluded_properties:
  - surface_normals
  - volume_normals
  - volume_sdf

A more complex model can use all available features:

# Load everything
excluded_properties: []

This pattern enables using the same dataset class for different models without modifying code.

Essential dataset methods¶

Beyond the getitem_* methods, dataset classes implement standard PyTorch dataset methods:

__len__ method:

Defines the total number of samples for one epoch:

def __len__(self) -> int:
    """Returns the total size of the dataset."""
    return len(self.uris)

This calculation accounts for dataset repetitions, useful for oversampling small datasets during training.

Most other methods follow standard PyTorch Dataset patterns. If you’re unfamiliar with PyTorch datasets, review the official PyTorch dataset tutorial.

Tensor normalization with decorators¶

In the Noether Framework, most of the normalization happens at the tensor level immediately after loading, using a decorator pattern for clean, declarative code.

The @with_normalizers decorator:

Apply normalization to any getitem_* method by adding a decorator:

@with_normalizers
def getitem_surface_position(self, idx: int) -> torch.Tensor:
    """Retrieves surface positions (num_surface_points, 3)"""
    return self._load(idx=idx, filename=self.filemap.surface_position)

By default, the decorator infers the normalizer key from the method name (stripping the getitem_ prefix). You can pass an explicit key when the normalizer name differs from the method name — for example, @with_normalizers("volume_sdf") on the getitem_surface_sdf method to reuse the volume SDF normalizer.

How it works:

The decorator identifies which normalizer(s) to apply using the key (derived from the method name, or explicitly provided)
Looks up the normalizer configuration in the dataset’s dataset_normalizers config
Applies the normalization transformation to the loaded tensor
Returns the normalized tensor

Configuring normalizers:

All normalizers are defined in noether.data.preprocessors.normalizers. The FieldNormalizer is the primary normalizer, which supports different strategies (mean_std, position, etc.):

surface_pressure:
  kind: noether.data.preprocessors.normalizers.FieldNormalizer
  strategy: mean_std
volume_velocity:
  kind: noether.data.preprocessors.normalizers.FieldNormalizer
  strategy: mean_std
volume_sdf:
  kind: noether.data.preprocessors.normalizers.FieldNormalizer
  strategy: mean_std
surface_position:
  kind: noether.data.preprocessors.normalizers.FieldNormalizer
  strategy: position
  stat_keys:
    min: raw_pos_min
    max: raw_pos_max
  scale: 1000
volume_position:
  kind: noether.data.preprocessors.normalizers.FieldNormalizer
  strategy: position
  stat_keys:
    min: raw_pos_min
    max: raw_pos_max
  scale: 1000

Each key maps to a normalizer configuration. The strategy field selects the normalization method — for example, mean_std for standard mean/std normalization, or position for coordinate normalization with min/max scaling. All normalizers must be invertible so that data can be denormalized for evaluation.

Computing dataset statistics¶

To use normalizers like MeanStdNormalization, you need to compute statistics from your training data.

Step 1: Compute statistics

Run the statistics calculation tool:

noether-dataset-stats \
  --dataset_kind=noether.data.datasets.cfd.ShapeNetCarDataset \
  --root=/path/to/shapenet_car/ \
  --split=train \
  --exclude_attributes=volume_velocity,volume_pressure,volume_vorticity,surface_normals,surface_friction

Parameters explained:

--dataset_kind: Full class path to your dataset
--root: Path to dataset directory
--split: Which split to compute statistics from (typically train)
--exclude_attributes: Properties to skip (either unavailable or not used)

Note

We exclude certain properties because they’re not available in ShapeNet-Car, even though the general AeroDataset interface defines getitem_* methods for them.

The statistics need to be manually added to a YAML file in configs/dataset_statistics/.

Noether dataset zoo¶

The Noether Framework includes pre-implemented datasets for CFD aerodynamics. For a complete listing, see Noether Dataset Zoo.

Dataset	Class Path	Data Processing README
ShapeNet-Car	`ShapeNetCarDataset`	ShapeNet-Car README
AhmedML	`AhmedMLDataset`	AhmedML README
DrivAerML	`DrivAerMLDataset`	DrivAerML README
DrivAerNet++	`DrivAerNetDataset`	DrivAerNet++ README
Wing Dataset	`EmmiWingDataset`	Wing README

All datasets share the AeroDataset interface, ensuring consistent access patterns and easy switching between datasets.

Creating custom datasets¶

To implement a custom dataset:

Inherit from noether.data.Dataset (or noether.data.datasets.cfd.AeroDataset)
Implement required getitem_* methods for your data fields
Override __init__ to discover and filter your data samples
Add @with_normalizers decorators where normalization is needed
Create a corresponding Pydantic schema in your schemas/datasets/ directory
Configure the normalizers

See the scaffold template (src/noether/scaffold/template_files/datasets/) for a minimal dataset implementation example, or run noether-init to generate a ready-to-use project (see Scaffolding a New Project). For a step-by-step guide, see How to Implement a Custom Dataset.

Datasets¶

Object instantiation¶

The Dataset¶

The getitem_* pattern: Modular data loading¶

ShapeNet-Car dataset structure¶

Dataset configuration¶

Selective data loading with excluded_properties¶

Essential dataset methods¶

Tensor normalization with decorators¶

Computing dataset statistics¶

Noether dataset zoo¶

Creating custom datasets¶

The `getitem_*` pattern: Modular data loading¶

Selective data loading with `excluded_properties`¶