noether.core.writers¶
Submodules¶
Classes¶
Class to easily write checkpoints in a structured way to the disk. |
|
Writes logs into a local file and (optionally) to an online webinterface (identified via the tracker). |
Package Contents¶
- class noether.core.writers.CheckpointWriter(path_provider, update_counter)¶
Class to easily write checkpoints in a structured way to the disk.
Each Model will be stored in a separate file where additionally weights and optimizer state are also separate files. This allows flexible storing of model states without producing files that are never needed after training. For example, to resume runs, one need the model weights and optimizer states. However, storing optimizer states for all checkpoints is expensive as optimizer states are commonly 2x as large as only the weights.
To illustrate the flexibility, consider the use-case of training an autoencoder model where the goal is to train a good encoder that should then be used for another task. This model is implemented via a class Autoencoder that inherits from CompositeModel and contains two submodels, an encoder and decoder (both which inherit from Model). During training, we want to store the following files to the disk: - The encoder weights after every 10 epochs to evaluate performance at various training lengths. - The latest weights and optimizer states of encoder and decoder to allow resuming a run if it crashes. The CheckpointWriter provides functionality to store the following files: - ab_upt_cp=E10_… model.th: encoder weights after 10 epochs - ab_upt_cp=E20_… model.th: encoder weights after 20 epochs - ab_upt_cp=E30_… model.th: encoder weights after 30 epochs - ab_upt_cp=last_model.th: latest encoder weights - ab_upt_cp=last_optim.th: latest encoder optimizer state
Each model checkpoint is populated with metadata. Each checkpoint will be a dictionary containing the keys: - “state_dict”: Weights of the model. - “model_config”: The model configuration used to instantiate the model. A serialized dict of the pydantic model config. - “checkpoint_tag”: The name of the checkpoint. E.g., E10_U200_S800 for a progress-based checkpoint or “latest” for a
string-based checkpoint.
“training_iteration”: The detailed information about training iteration as a dict with keys ‘epoch’, ‘update’, and ‘sample’. E.g., for the “latest” checkpoint you would not know from which epoch the checkpoint is, therefore the “training_iteration” field of that checkpoint contains “E13_U…_S…”.
“run_id”: The ID of the run from which it was created.
- Parameters:
path_provider (noether.core.providers.PathProvider)
update_counter (noether.core.utils.training.UpdateCounter)
- logger¶
- path_provider¶
- update_counter¶
- save_model_checkpoint(model_name, checkpoint_tag, state_dict, model_config=None, model_info=None, **extra)¶
Save a checkpoint to disk.
The ouput name of the checkpoint will be constructed as {model_name}{model_info}_cp={checkpoint}_model.th (where model_info is an optional string and will be empty if not provided). For example, if model_name is “autoencoder.encoder” and checkpoint is “E10_U200_S800”, the output name will be “autoencoder.encoder_cp=E10_U200_S800_model.th”. However, if we store different model that the current one that is trained, for example, the EMA model, we can also provide additional info in the output name. For example, if model_name is “autoencoder.encoder” and model_info is “ema”, and checkpoint is “E10_U200_S800”, the output name will be “autoencoder.encoder_ema_cp=E10_U200_S800_model.th”.
- Parameters:
model_name (str) – Name of the model.
checkpoint_tag (str) – Checkpoint tag, for example “latest” or “E10_U200_S800”.
model_config (noether.core.schemas.models.ModelBaseConfig | None) – Model configuration. Defaults to None.
model_info (str | None) – Additional info to include in the output name. Defaults to None.
**extra
- Raises:
RuntimeError – in case of an unexpected error while parsing model_config.
- Return type:
None
- save(model, checkpoint_tag, trainer=None, save_weights=True, save_optim=True, save_latest_weights=False, save_latest_optim=False, model_names_to_save=None, save_frozen_weights=True, model_info=None)¶
Saves a model to the disk.
- Parameters:
model (noether.core.models.ModelBase) – Model to save.
checkpoint_tag (str) – Checkpoint tag, for example “latest” or “E10_U200_S800”.
trainer (noether.training.trainers.BaseTrainer | None) – If defined, also stores the state_dict of the trainer (and callbacks).
save_weights (bool) – If true, stores model weights.
save_optim (bool) – If true, stores optimizer states.
save_latest_weights (bool) – If true, also stores the weights with the checkpoint identifier “latest”. This file will be repeatedly overwritten throughout a training procedure to save storage.
save_latest_optim (bool) – If true, also stores the optimizer states with the checkpoint identifier “latest”. This file will be repeatedly overwritten throughout a training procedure to save storage.
model_names_to_save (list[str] | None) – If defined, only store some of the submodels of a CompositeModel.
save_frozen_weights (bool) – If true, also stores the weights of frozen models.
model_info (str | None)
- Return type:
None
- class noether.core.writers.LogWriter(path_provider, update_counter, tracker)¶
Writes logs into a local file and (optionally) to an online webinterface (identified via the tracker). All logs that should be written for a certain update in the training will be cached and written to the tracker all at once. For writing the logs to disk, everything for the full training process is cached and only after training finished, the logs are written to disk (writing repeatedly to disk takes a long time).
- Parameters:
path_provider (noether.core.providers.PathProvider) – Provides the path to store all logs to the disk after training.
update_counter (noether.core.utils.training.UpdateCounter) – Provides the current training progress add the current epoch/update/sample to every log entry. This allows, e.g., changing the x-axis to “epoch” in online visiualization tools.
tracker (noether.core.trackers.BaseTracker) – Provides an interface for logging to an online experiment tracking platform.
- logger¶
- path_provider¶
- update_counter¶
- tracker¶
- get_all_metric_values(key)¶
Retrieves all values of a metric from all log entries. Used mainly for integration tests.
- finish()¶
Stores all logs from the training to the disk.
- Return type:
None
- flush()¶
Composes a log entry with all metrics that were calculated for the current training state. This method is called after every update and if no metrics were logged, it simply does nothing.
- Return type:
None
- add_scalar(key, value, logger=None, format_str=None)¶
Adds a scalar value to the log. :param key: Metric identifier. :param value: Scalar tensor or float with the value that should be logged. :param logger: If defined, will log the value to stdout. :param format_str: If defined, will alter the log to stdout to be in the provided format.
- Parameters:
key (str)
value (torch.Tensor | numpy.generic | float)
logger (logging.Logger | None)
format_str (str | None)
- Return type:
None