How to Run Evaluation on Trained Models¶
The evaluation pipeline allow you to run inference and compute metrics on a model that has been previously trained. This is useful for evaluating best checkpoints on test sets, generating visualizations, or performing late-stage analysis without starting a full training run.
Overview¶
The evaluation process uses the noether-eval CLI command, which:
1. Loads the original training configuration from the run directory (provided as input_dir).
2. Merges it with an inference-specific configuration.
3. Initializes the model using weights from a specified checkpoint.
4. Executes the configured trainer’s eval() method, which runs all configured callbacks.
The outputs are stored in a new directory separate from the training outputs, preserving the original run data.
The CLI: noether-eval¶
The command is installed via the noether package. It utilizes Hydra for configuration management and supports merging multiple YAML files or command-line overrides.
Required Arguments¶
The script is also using Hydra. Config can be provided via the --hp flag or directly as CLI arguments. The following argument is required:
input_dir: The absolute or relative path to the directory of the training run you wish to evaluate. This directory must contain ahp_resolved.yamlfile.
Examples¶
Basic evaluation using the latest checkpoint. This runs the same evaluation callbacks as configured during training:
noether-eval +input_dir=outputs/2026-01-10/10-00-00
Evaluation on a specific checkpoint:
noether-eval +input_dir=outputs/2026-01-10/10-00-00 resume_checkpoint=best_accuracy --hp configs/inference/visualization.yaml
Run evaluation with modified callbacks, for example to calculate offline losses on the test set:
# configs/inference/custom_eval_callbacks.yaml trainer:
callbacks: - kind: noether.training.callbacks.OfflineLossCallback
dataset_key: test name: OfflineLossCallback
Configuration Merging¶
`noether-eval` performs a deep merge of configurations:
1. Base: The stored config from the input_dir (looked up from``hp_resolved.yaml``)
2. Override: The configuration provided via --hp or direct CLI arguments.
This allows you to easily switch datasets, modify callback parameters, or change evaluation settings while keeping the model architecture and other training-time settings intact.
Inference Runner¶
The InferenceRunner (found in src/noether/inference/runners/inference_runner.py) is responsible for setting up the environment similarly to the training runner but with key differences:
Weight Loading: It uses the
PreviousRunInitializer, which only loads model weights and skips optimizer/scheduler states.Eval Mode: It calls
trainer.eval(model)instead oftrainer.train(model).
Trainer Evaluation Mode¶
All evaluation mode is doing is executing the configured callbacks on the saved model weights.
This way we can reuse the same callback implementations for both training and evaluation, ensuring consistency in metrics computation and visualization generation.
By default the same callbacks used during training are also used during evaluation.
However, you can customize this by modifying the configuration passed to noether-eval to include different or additional callbacks as needed.
This way you can, for example, add extra visualization callbacks or change metric logging behavior specifically for evaluation runs.