How to continue Training from a Checkpoint¶
Noether stores model checkpoints, with the CheckpointWriter, a structured file naming convention:
{model_name}_{model_info}_cp={checkpoint}_model.th
This naming scheme consists of three main components:
- Model Name (
model_name) A unique identifier for your model (e.g.,
transformer,upt,ab-upt). This must be configured in your model config and is used to identify which model the checkpoint belongs to.- Checkpoint Identifier (
checkpoint) Indicates when in training the model was saved. Common values include:
latest— the most recent periodic checkpointbest.{metric_name}— the checkpoint with the best value for a specific metric (e.g.,best.accuracy)E10_U100_S200— a specific training point (epoch 10, update 100, sample 200)
- Model Info (
model_info) (optional) Additional metadata to distinguish special checkpoint variants. For example:
ema_factor=0.9999— Exponential Moving Average (EMA) weightsIf omitted, the filename becomes:
{model_name}_cp={checkpoint}_model.thModel info is defined by the user (for example in a custom callback) and can be anything. Make sure the model info is informative and conistent.
Checkpoint Storage Location¶
Checkpoints are stored in the following directory structure:
<output_dir>/<run_id>/<stage_name>/checkpoints/{model_name}_{model_info}_cp={checkpoint}_model.th
Where:
output_dir— Your configured output directoryrun_id— The unique identifier for the training runstage_name— The name of the training stage (if using multi-stage training)
Resuming Training from a Previous Run¶
To resume training from a checkpoint (e.g., after a crash or interruption), add the following to your root configuration file:
resume_run_id: <id of the previous run>
resume_stage_name: <stage name from the previous run>
stage_name: continue_training # Optional: defaults to resumed stage name if not specified
resume_checkpoint: <checkpoint> # Optional: specify a particular checkpoint to resume from
If no resume_checkpoint is specified, the training will resume from the latest checkpoint of the specified stage.
For the resume_checkpoint, you can specify a particular checkpoint by using either the epoch (e.g., E10) the update (e.g., U100) or the sample (e.g., S200) to resume from.
Do not use the fully specified checkpoint filename here (e.g., E10_U100_S200).
Initializing Model Weights from a Previous Run¶
To initialize only the model weights (without resuming the full training state), use the PreviousRunInitializer
in your model configuration:
model:
name: my_model
kind: path.to.MyModel
initializers:
- kind: noether.core.initializers.PreviousRunInitializer
run_id: <previous run_id>
model_name: <model_name>
stage_name: <stage_name> # Optional: leave empty if no stage
checkpoint_tag: <checkpoint> # e.g., latest, best.accuracy, E10_U100_S200
model_info: <model_info> # Optional: e.g., ema_factor=0.9999
This approach is useful for:
Transfer learning from a pretrained model
Fine-tuning on a new dataset
Starting a new training run with pretrained weights
Use Cases¶
- Full Training Resumption
Use
resume_run_idandresume_stage_namewhen you need to continue training exactly where it left off, preserving optimizer state, callback states (if any), and training/trainer progress/state.- Weight Initialization Only
Use
PreviousRunInitializerwhen you want to start a training from scratch with pretrained weights.