noether.core.schemas.trainers

Attributes

Classes

CheckpointConfig

BaseTrainerConfig

Internal base class for all registry-based configs.

WeightedLossTrainerConfig

Config for a generic trainer that computes weighted loss per output field.

Module Contents

class noether.core.schemas.trainers.CheckpointConfig(/, **data)

Bases: pydantic.BaseModel

Parameters:

data (Any)

epoch: int | None = None
update: int | None = None
sample: int | None = None
noether.core.schemas.trainers.TCallbackConfig
class noether.core.schemas.trainers.BaseTrainerConfig[TCallbackConfig: noether.core.schemas.callbacks.CallBackBaseConfig](/, **data)

Bases: noether.core.schemas.lib._RegistryBase

Internal base class for all registry-based configs.

Provides auto-registration via __init_subclass__. Not meant to be used directly - use specific config base classes instead.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

kind: str
max_epochs: int | None = None

The maximum number of epochs to train for. Mutually exclusive with max_updates and max_samples. If set to 0, training will be skipped and all callbacks will be invoked once (useful for evaluation-only runs).

max_updates: int | None = None

The maximum number of updates to train for. Mutually exclusive with max_epochs and max_samples. If set to 0, training will be skipped and all callbacks will be invoked once (useful for evaluation-only runs).

max_samples: int | None = None

The maximum number of samples to train for. Mutually exclusive with max_epochs and max_updates. If set to 0, training will be skipped and all callbacks will be invoked once (useful for evaluation-only runs).

start_at_epoch: int | None = None

The epoch to start training at. This means that the trainer will skip all epochs before this epoch. Learning rate and other schedulers will be stepped accordingly. Useful for resuming training from a specific epoch.

add_default_callbacks: bool | None = None

Whether to add default callbacks. Default callbacks log things like simple dataset statistics or the current value of the learning rate if it is scheduled.

add_trainer_callbacks: bool | None = None

Whether to add trainer specific callbacks (e.g., a callback to log the training accuracy for a classification task).

effective_batch_size: int = None

the “global batch size”. In multi-GPU setups, the batch size per device, (“local batch size”) is effective_batch_size / number of devices. If gradient accumulation is used, the forward-pass batch size is derived by dividing by the number of gradient accumulation steps.

Type:

The effective batch size used for optimization. This is the number of samples that are processed before an update step is taken

precision: Literal['float32', 'fp32', 'float16', 'fp16', 'bfloat16', 'bf16'] = None

The precision to use for training (e.g., “float32”). Mixed precision training (e.g., “float16” or “bfloat16”) can be used to speed up training and reduce memory usage on supported hardware (e.g., NVIDIA GPUs).

callbacks: list[Annotated[TCallbackConfig, Discriminated(CallBackBaseConfig)]] | None = None

The callbacks to use for training.

initializer: noether.core.schemas.initializers.InitializerConfig | None = None

The initializer to use for training. Mainly used for resuming training via ResumeInitializer.

log_every_n_epochs: int | None = None

The integer number of epochs to periodically log at.

log_every_n_updates: int | None = None

The integer number of updates to periodically log at.

log_every_n_samples: int | None = None

The integer number of samples to periodically log at.

track_every_n_epochs: int | None = None

The integer number of epochs to periodically track metrics at.

track_every_n_updates: int | None = None

The integer number of updates to periodically track metrics at.

track_every_n_samples: int | None = None

The integer number of samples to periodically track metrics at.

max_batch_size: int | None = None

The maximum batch size to use for model forward pass in training. If the effective_batch_size is larger than max_batch_size, gradient accumulation will be used to simulate the larger batch size. For example, if effective_batch_size=8 and max_batch_size=2, 4 gradient accumulation steps will be taken before each optimizer step.

skip_nan_loss: bool = None

Whether to skip NaN losses. These can sometimes occur due to unlucky coincidences. If true, NaN losses will be skipped without terminating the training up until 100 NaN losses occurred in a row.

skip_nan_loss_max_count: int = None
disable_gradient_accumulation: bool = None

Whether to disable gradient accumulation. Gradient accumulation is sometimes used to simulate larger batch sizes, but can lead to worse generalization.

save_on_sigint: bool = None

Whether to save a checkpoint on SIGINT (Ctrl+C). SIGTERM always triggers a checkpoint save. When False (default), Ctrl+C will stop training immediately without saving.

use_torch_compile: bool = None

Whether to use torch.compile to compile the model for faster training.

find_unused_params: bool = None

Sets the find_unused_parameters flag of DistributedDataParallel.

static_graph: bool = None

Sets the static_graph flag of DistributedDataParallel.

forward_properties: list[str] | None = []

Properties (i.e., keys from the batch dict) from the input batch that are used as inputs to the model during the forward pass.

target_properties: list[str] | None = []

Properties (i.e., keys from the batch dict) from the input batch that are used as targets for the model during the forward pass.

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

dataloader_prefetch_factor: int | None = None

The prefetch_factor to use for the training dataloader. This controls how many batches are prefetched by each worker process in the dataloader. Increasing this can speed up training if data loading is a bottleneck, but also increases memory usage.

validate_callback_frequency()

Ensures that exactly one frequency (‘every_n_*’) is specified and that ‘batch_size’ is present if ‘every_n_samples’ is used.

Return type:

BaseTrainerConfig

validate_max_training_criteria()

Ensures that exactly one of max_epochs, max_updates, or max_samples is specified.

Return type:

BaseTrainerConfig

class noether.core.schemas.trainers.WeightedLossTrainerConfig(/, **data)

Bases: BaseTrainerConfig

Config for a generic trainer that computes weighted loss per output field.

field_weights maps output field names to their loss weights. Keys must match model output dict keys. Target keys in the batch are expected to follow the <field_name>_target convention.

Built-in loss example:

WeightedLossTrainerConfig(
    kind="noether.training.trainers.WeightedLossTrainer",
    field_weights={"surface_pressure": 1.0, "volume_velocity": 1.0},
    loss_fn="l1",
)

Custom loss function from a downstream project:

WeightedLossTrainerConfig(
    kind="noether.training.trainers.WeightedLossTrainer",
    field_weights={"surface_pressure": 1.0},
    loss_fn="my_project.losses.weighted_huber",
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

field_weights: dict[str, float] = None
loss_fn: str = None