noether.core.writers

Submodules

Classes

CheckpointWriter

Class to easily write checkpoints in a structured way to the disk.

LogWriter

Writes logs into a local file and (optionally) to an online webinterface (identified via the tracker).

Package Contents

class noether.core.writers.CheckpointWriter(path_provider, update_counter)

Class to easily write checkpoints in a structured way to the disk.

Each Model will be stored in a separate file where additionally weights and optimizer state are also separate files. This allows flexible storing of model states without producing files that are never needed after training. For example, to resume runs, one need the model weights and optimizer states. However, storing optimizer states for all checkpoints is expensive as optimizer states are commonly 2x as large as only the weights.

To illustrate the flexibility, consider the use-case of training an autoencoder model where the goal is to train a good encoder that should then be used for another task. This model is implemented via a class Autoencoder that inherits from CompositeModel and contains two submodels, an encoder and decoder (both which inherit from Model). During training, we want to store the following files to the disk: - The encoder weights after every 10 epochs to evaluate performance at various training lengths. - The latest weights and optimizer states of encoder and decoder to allow resuming a run if it crashes. The CheckpointWriter provides functionality to store the following files: - ab_upt_cp=E10_… model.th: encoder weights after 10 epochs - ab_upt_cp=E20_… model.th: encoder weights after 20 epochs - ab_upt_cp=E30_… model.th: encoder weights after 30 epochs - ab_upt_cp=last_model.th: latest encoder weights - ab_upt_cp=last_optim.th: latest encoder optimizer state

Each model checkpoint is populated with metadata. Each checkpoint will be a dictionary containing the keys: - “state_dict”: Weights of the model. - “model_config”: The model configuration used to instantiate the model. A serialized dict of the pydantic model config. - “checkpoint_tag”: The name of the checkpoint. E.g., E10_U200_S800 for a progress-based checkpoint or “latest” for a

string-based checkpoint.

  • “training_iteration”: The detailed information about training iteration as a dict with keys ‘epoch’, ‘update’, and ‘sample’. E.g., for the “latest” checkpoint you would not know from which epoch the checkpoint is, therefore the “training_iteration” field of that checkpoint contains “E13_U…_S…”.

  • “run_id”: The ID of the run from which it was created.

Parameters:
logger
path_provider
update_counter
save_model_checkpoint(model_name, checkpoint_tag, state_dict, model_config=None, model_info=None, **extra)

Save a checkpoint to disk.

The ouput name of the checkpoint will be constructed as {model_name}{model_info}_cp={checkpoint}_model.th (where model_info is an optional string and will be empty if not provided). For example, if model_name is “autoencoder.encoder” and checkpoint is “E10_U200_S800”, the output name will be “autoencoder.encoder_cp=E10_U200_S800_model.th”. However, if we store different model that the current one that is trained, for example, the EMA model, we can also provide additional info in the output name. For example, if model_name is “autoencoder.encoder” and model_info is “ema”, and checkpoint is “E10_U200_S800”, the output name will be “autoencoder.encoder_ema_cp=E10_U200_S800_model.th”.

Parameters:
  • model_name (str) – Name of the model.

  • checkpoint_tag (str) – Checkpoint tag, for example “latest” or “E10_U200_S800”.

  • state_dict (dict[str, Any]) – Model state dict to save.

  • model_config (noether.core.schemas.models.ModelBaseConfig | None) – Model configuration. Defaults to None.

  • model_info (str | None) – Additional info to include in the output name. Defaults to None.

  • **extra

Raises:

RuntimeError – in case of an unexpected error while parsing model_config.

Return type:

None

save(model, checkpoint_tag, trainer=None, save_weights=True, save_optim=True, save_latest_weights=False, save_latest_optim=False, model_names_to_save=None, save_frozen_weights=True, model_info=None)

Saves a model to the disk.

Parameters:
  • model (noether.core.models.ModelBase) – Model to save.

  • checkpoint_tag (str) – Checkpoint tag, for example “latest” or “E10_U200_S800”.

  • trainer (noether.training.trainers.BaseTrainer | None) – If defined, also stores the state_dict of the trainer (and callbacks).

  • save_weights (bool) – If true, stores model weights.

  • save_optim (bool) – If true, stores optimizer states.

  • save_latest_weights (bool) – If true, also stores the weights with the checkpoint identifier “latest”. This file will be repeatedly overwritten throughout a training procedure to save storage.

  • save_latest_optim (bool) – If true, also stores the optimizer states with the checkpoint identifier “latest”. This file will be repeatedly overwritten throughout a training procedure to save storage.

  • model_names_to_save (list[str] | None) – If defined, only store some of the submodels of a CompositeModel.

  • save_frozen_weights (bool) – If true, also stores the weights of frozen models.

  • model_info (str | None)

Return type:

None

class noether.core.writers.LogWriter(path_provider, update_counter, tracker)

Writes logs into a local file and (optionally) to an online webinterface (identified via the tracker). All logs that should be written for a certain update in the training will be cached and written to the tracker all at once. For writing the logs to disk, everything for the full training process is cached and only after training finished, the logs are written to disk (writing repeatedly to disk takes a long time).

Parameters:
logger
path_provider
update_counter
tracker
log_entries: list[dict[str, Any]] = []
log_cache: dict[str, Any] | None = None
non_scalar_keys: set[str]
get_all_metric_values(key)

Retrieves all values of a metric from all log entries. Used mainly for integration tests.

Parameters:

key (str) – Identifier of the metric.

Returns:

The values of the metric over the course of training. Log entries that do not contain the key are skipped.

Return type:

list[float]

finish()

Stores all logs from the training to the disk.

Return type:

None

flush()

Composes a log entry with all metrics that were calculated for the current training state. This method is called after every update and if no metrics were logged, it simply does nothing.

Return type:

None

add_scalar(key, value, logger=None, format_str=None)

Adds a scalar value to the log. :param key: Metric identifier. :param value: Scalar tensor or float with the value that should be logged. :param logger: If defined, will log the value to stdout. :param format_str: If defined, will alter the log to stdout to be in the provided format.

Parameters:
Return type:

None

add_nonscalar(key, value)

Adds a non-scalar value to the log. :param key: Metric identifier. :param value: Non-scalar value that should be logged (e.g., wandb.Image, wandb.Histogram, …).

Parameters:
  • key (str)

  • value (Any)

Return type:

None