noether.training.trainers

Submodules

Classes

BaseTrainer

Base class for all trainers that use SGD-based optimizers.

TrainerResult

Package Contents

class noether.training.trainers.BaseTrainer(config, data_container, device, tracker, path_provider, main_sampler_kwargs=None, metric_property_provider=None)

Base class for all trainers that use SGD-based optimizers.

This class implements the main training loop and provides utility functions for logging, checkpointing, and callbacks. In your down-stream you have to implement the loss_compute method that calculates the loss based on the model output and the targets. Optionally, you can also override the train_step method if you want to implement a custom training step (e.g., for multi-loss training or custom backward logic). If you only want to implement a custom loss calculation but keep the rest of the training loop, you can just override the loss_compute method. For example:

class MyTrainer(BaseTrainer):
    def __init__(self, trainer_config: BaseTrainerConfig, **kwargs):
        super().__init__(trainer_config, **kwargs)

    def loss_compute(
        self, forward_output: dict[str, torch.Tensor], targets: dict[str, torch.Tensor]
    ) -> LossResult:
        # compute loss based on model output and targets
        return loss
Parameters:
logger
config
data_container
path_provider
main_sampler_kwargs = None
device: torch.device
end_checkpoint
precision
updates_per_epoch
skip_nan_loss_counter = 0
initializer: noether.core.initializers.InitializerBase | None = None
tracker
metric_property_provider = None
update_counter
log_writer
checkpoint_writer
callbacks: list[noether.core.callbacks.CallbackBase] = []
forward_properties
target_properties
batch_keys
get_user_callbacks(model, evaluation=False)
Parameters:

model (noether.core.models.ModelBase)

Return type:

list[noether.core.callbacks.CallbackBase]

get_all_callbacks(model)

Get all callbacks including default/trainer callbacks.

Parameters:

model (noether.core.models.ModelBase)

Return type:

list[noether.core.callbacks.CallbackBase]

get_trainer_callbacks(callback_default_args)

Get trainer-specific callbacks. This may optionally be overridden by derived classes.

Parameters:

callback_default_args (dict[str, Any])

Return type:

list[noether.core.callbacks.CallbackBase]

get_default_callback_intervals()

Get default intervals at which callbacks are called.

Return type:

dict[str, Any]

get_default_callbacks(default_kwargs)
Parameters:

default_kwargs (dict[str, Any])

Return type:

list[noether.core.callbacks.CallbackBase]

state_dict()

Get the state dict of the trainer.

Return type:

dict[str, Any]

load_state_dict(state_dict)

Load the state dict of the trainer.

Parameters:

state_dict (dict[str, Any])

Return type:

None

apply_resume_initializer(model)

Apply the resume initializer to the model.

Parameters:

model (noether.core.models.ModelBase)

Return type:

None

get_data_loader(iterator_callbacks, batch_size, evaluation=False)

Get the data loader for training.

Parameters:
Return type:

torch.utils.data.DataLoader

abstractmethod loss_compute(forward_output, targets)

Each trainer that extends this class needs to implement a custom loss computation using the targets and the model output.

Parameters:
  • forward_output (dict[str, torch.Tensor]) – Output of the model after the forward pass.

  • targets (dict[str, torch.Tensor]) – Dict with target tensors needed to compute the loss for this trainer.

Returns:

A dict with the (weighted) sub-losses to log. Or a tuple of (losses, additional_outputs) where additional_outputs is a dict with additional information about the model forward pass that is passed to the track_after_accumulation_step method of the callbacks, e.g., the logits and targets to calculate a training accuracy in a callback).

Return type:

noether.training.trainers.types.LossResult | tuple[noether.training.trainers.types.LossResult, dict[str, torch.Tensor]]

Note: If a tuple is returned, the second element will be passed as additional_outputs in the TrainerResult returned by the train_step method.

train_step(batch, model)

Overriding this function is optional. By default, the train_step of the model will be called and is expected to return a TrainerResult. Trainers can override this method to implement custom training logic.

Parameters:
Returns:

TrainerResult dataclass with the loss for backpropagation, (optionally) individual losses if multiple losses are used, and (optionally) additional information about the model forward pass that is passed to the callbacks (e.g., the logits and targets to calculate a training accuracy in a callback).

Return type:

noether.training.trainers.types.TrainerResult

wrap_model(model)

Wrap the model for training, return the model, wrapped model and ddp+compiled model.

Parameters:

model (noether.core.models.ModelBase)

Return type:

torch.nn.Module

wrap_ddp(model)

Wrap the model with DistributedDataParallel in multi-GPU settings.

Parameters:

model (noether.core.models.ModelBase)

Return type:

noether.core.models.ModelBase | torch.nn.parallel.DistributedDataParallel

wrap_compile(ddp_model)

Wrap the model with torch.compile.

Parameters:

ddp_model (noether.core.models.ModelBase | torch.nn.parallel.DistributedDataParallel)

Return type:

torch.nn.Module

train(model)

Train the model.

Parameters:

model (noether.core.models.ModelBase)

Return type:

None

static drop_metadata(data)
update(batch, dist_model, model, accumulation_steps_total, accumulation_step, retain_graph=False)

Perform forward and backward pass.

Parameters:
Return type:

tuple[dict[str, torch.Tensor], dict[str, torch.Tensor] | None, dict[str, noether.core.utils.common.stopwatch.Stopwatch]]

call_before_training(callbacks)

Hook that is called before training starts.

Parameters:

callbacks (list[noether.core.callbacks.CallbackBase])

Return type:

None

call_after_training(callbacks)

Hook that is called after training ends.

Parameters:

callbacks (list[noether.core.callbacks.CallbackBase])

Return type:

None

eval(model)

Run evaluation by executing all configured callbacks.

Parameters:

model (noether.core.models.ModelBase)

Return type:

None

property total_training_updates: int
Return type:

int

class noether.training.trainers.TrainerResult
total_loss: torch.Tensor
losses_to_log: dict[str, torch.Tensor] | None = None
additional_outputs: dict[str, torch.Tensor] | None = None