noether.data.base¶
Submodules¶
Classes¶
Noether dataset implementation, which is a wrapper around torch.utils.data.Dataset that can hold a dataset_config_provider. |
|
Wrapper around arbitrary noether.data.Dataset instances to only use a subset of the samples, similar to |
|
Wrapper around arbitrary noether.data.Dataset instances to generically change something about the dataset. |
Functions¶
|
Decorator to apply a normalizer to the output of a getitem_* function of the implemented Dataset class. |
Package Contents¶
- class noether.data.base.Dataset(dataset_config)¶
Bases:
torch.utils.data.DatasetNoether dataset implementation, which is a wrapper around torch.utils.data.Dataset that can hold a dataset_config_provider. A dataset should map a key (i.e., an index) to its corresponding data. Each sub-class should implement individual getitem_* methods, where * is the name of an item in the dataset. Each getitem_* method loads an individual tensor/data sample from disk. For example, if you dataset consists of images and targets/labels (stored as tensors), a getitem_image(idx) and getitem_target(idx) method should be implemented in the dataset subclass. The __getitem__ method of this class will loop over all the individual getitem_* methods implemented by the child class and return their results. Optionally it is possible to configure which getitem methods are called.
Example: Image classification datasets
class CarAeroDynamicsDataset(Dataset): def __init__(self, dataset_config, dataset_normalizers, **kwargs): super().__init__(dataset_config=dataset_config, **kwargs) self.path = dataset_config.path def __len__(self): return 100 # Example length def getitem_surface_pressure(self, idx): # Load surface pressure tensor return torch.load(f"{self.path}/surface_pressure_tensor/{idx}.pt") def getitem_surface_geometry(self, idx): # Load surface geometry tensor return torch.load(f"{self.path}/surface_geometry_tensor/{idx}.pt") dataset = CarAeroDynamicsDataset("path/to/dataset") sample0 = dataset[0] surface_pressure_0 = sample0["surface_pressure"] surface_geometry_0 = sample0["surface_geometry"]
Data from a getitem method should be normalized in many cases. To apply normalization, add a the decorator function to the getitem method. For example:
@with_normalizers("surface_pressure") def getitem_surface_pressure(self, idx): # Load surface pressure tensor return torch.load(f"{self.path}/surface_pressure_tensor/{idx}.pt")
“surface_pressure” is the key in the self.normalizers dictionary, this key maps to a preprocessor that should implement the correct data normalization.
Example configuration for dataset normalizers:
# dummy example configuration for an image classification dataset: kind: noether.data.datasets.CarAeroDynamicsDataset pipeline: # configure the data pipeline to collate individual samples into batches dataset_normalizers: surface_pressure: - kind: noether.data.preprocessors.normalizers.MeanStdNormalization mean: [1., 2., 3.] std: [0.1, 0.2, 0.3]
- Parameters:
dataset_config (noether.core.schemas.dataset.DatasetBaseConfig) – Configuration for the dataset. See
DatasetBaseConfigfor available options including dataset normalizers.
- logger¶
- config¶
- normalizers: dict[str, noether.data.preprocessors.ComposePreProcess]¶
- compute_statistics = False¶
- fetch_statistics()¶
Load and cache dataset statistics from the dataset’s STATS_FILE.
By default looks for a
STATS_FILEclass attribute on the dataset class (or its ancestors). The file should be a YAML file mapping stat names to scalar or list values.
- property pipeline: noether.data.pipeline.Collator | None¶
Returns the pipeline for the dataset.
- Return type:
- pre_getitem(idx)¶
Optional hook called once before the individual
getitem_*methods.Override this to load shared data (e.g. an HDF5 file that contains multiple fields) and return it as a dictionary. The returned dict is forwarded as keyword arguments to every
getitem_*call for the same sample, so each getter can pull its field without re-opening the file.The default implementation returns an empty dict
- post_getitem(idx, pre)¶
Optional hook called once after all
getitem_*methods have run.Override this to perform per-sample cleanup (e.g. closing a file handle that was opened in
pre_getitem()).The pre argument is the value originally returned by
pre_getitem()so that the cleanup logic can access the same resources.The default implementation does nothing.
- get_all_getitem_names()¶
Returns all names of getitem functions that are implemented. E.g., image classification has getitem_x and getitem_class -> the result will be [“x”, “class”].
- denormalize(key, data)¶
Denormalize data using the appropriate normalizer.
This method finds the specific normalizer for the given key and uses it to denormalize, instead of calling pipeline.denormalize which would process the entire pipeline.
- noether.data.base.with_normalizers(_func_or_key=None)¶
Decorator to apply a normalizer to the output of a getitem_* function of the implemented Dataset class.
This decorator will look for a normalizer registered under the specified key and apply it to the output of the decorated function. If no key is provided, the key is automatically inferred from the function name by removing the ‘getitem_’ prefix.
Example usage:
# Inferred key: "surface_pressure" @with_normalizers def getitem_surface_pressure(self, idx): return torch.load(f"{self.path}/surface_pressure/{idx}.pt") # Explicit key: "pressure" @with_normalizers("pressure") def getitem_surface_pressure(self, idx): return torch.load(f"{self.path}/surface_pressure/{idx}.pt")
- Parameters:
_func_or_key (str | Any | None) – The normalizer key (str) or the function being decorated. If used as @with_normalizers (no arguments), this will be the decorated function. If used as @with_normalizers(“key”), this will be the string key.
- Returns:
The decorated function with normalization applied.
- Raises:
ValueError – If the normalizer key cannot be resolved from the function name.
AttributeError – If the class instance does not have a ‘normalizers’ attribute.
KeyError – If the requested normalizer key is not found in the ‘normalizers’ dictionary.
- class noether.data.base.Subset(dataset, indices)¶
Bases:
noether.data.base.wrapper.DatasetWrapperWrapper around arbitrary noether.data.Dataset instances to only use a subset of the samples, similar to torch.utils.Subset, but with support for individual getitem_* methods instead of the __getitem__ method.
Example:
from noether.data import SubsetWrapper, Dataset len(dataset) # 10 subset = SubsetWrapper(dataset=dataset, indices=[0, 2, 5, 7]) len(subset) # 4 subset[4] # returns dataset[7]
Initializes the Subset wrapper.
- Parameters:
dataset (noether.data.base.dataset.Dataset) – The base dataset to be wrapped
indices (collections.abc.Sequence[int] | numpy.typing.NDArray[numpy.integer]) – valid indices of the wrapped dataset (list, tuple, or numpy array)
- indices¶
- class noether.data.base.DatasetWrapper(dataset)¶
Wrapper around arbitrary noether.data.Dataset instances to generically change something about the dataset. For example:
Create a subset of the dataset (noether.data.Subset)
Define which properties/items to load from the dataset, i.e., which getitem_* methods to call (noether.data.ModeWrapper)
What exactly is changed depends on the specific implementation of the DatasetWrapper child class.
- Parameters:
dataset (noether.data.base.dataset.Dataset | DatasetWrapper) – base dataset to be wrapped
- dataset¶