How to Implement a Custom Dataset

Below we provide a minimal (dummy code) example of how to create a custom dataset by extending the base Dataset class. Every single tensor that belongs to a ‘data sample’ must have its own getitem_* method, with a unique suffix. By default, all getitem_* will be called when fetching a data sample, unless specified otherwise in the configuration file (by configuring excluded_properties). To apply data normalization, the @with_normalizers decorator must be used on each getitem_* method. The key provided to the decorator must match the key of the configured normalizer.

from noether.data import Dataset, with_normalizers
from noether.core.schemas.dataset import DatasetBaseConfig
import torch
import os

class MyCustomDatasetConfig(DatasetBaseConfig):
    kind: str = "path.to.MyCustomDataset"
    # Add any custom configuration fields here
    data_paths: dict[int, str]

class MyCustomDataset(Dataset):
    def __init__(self, config: MyCustomDatasetConfig):
        super().__init__(config)
        self.data_paths = config.data_paths
        self.root = config.root

    def __len__(self):
        # Return the length of your dataset
        return len(self.data_paths)

    @with_normalizers("tensor_x")
    def getitem_tensor_x(self, idx: int) -> torch.Tensor:
        # Load and return the data sample and its corresponding label as tensors
        return torch.load(os.path.join(self.root, self.data_paths[idx]), weights_only=True)

    @with_normalizers("tensor_y")
    def getitem_tensor_y(self, idx: int) -> torch.Tensor:
        # Load and return the data sample and its corresponding label as tensors
        return torch.load(os.path.join(self.root, self.data_paths[idx]), weights_only=True)
datasets:
    custom_dataset:
        kind: path.to.MyCustomDataset
        root: /path/to/data
        data_paths:
            0: sample_0.pt
            1: sample_1.pt
            2: sample_2.pt
            # Add more data paths as needed
        excluded_properties: []  # Optionally exclude certain getitem_* methods
tensor_x:
    - kind: noether.data.preprocessors.normalizers.MeanStdNormalization
      mean: 0.0
      std: 1.0
tensor_y:
    - kind: noether.data.preprocessors.normalizers.MeanStdNormalization
      mean: 1.0
      std: 2.0

Let’s say we run a training pipeline with the above dataset configuration and a batch size of 4. Then the output will be a list of 4 dictionaries (a batch), where each dictionary has the following structure:

[
    {
        "tensor_x": <tensor_x_sample_0>,
        "tensor_y": <tensor_y_sample_0>,
    },
    # ... (3 more samples)
]