How to Run Evaluation on Trained Models

The evaluation pipeline allow you to run inference and compute metrics on a model that has been previously trained. This is useful for evaluating best checkpoints on test sets, generating visualizations, or performing late-stage analysis without starting a full training run.

Overview

The evaluation process uses the noether-eval CLI command, which: 1. Loads the original training configuration from the run directory (provided as input_dir). 2. Merges it with an inference-specific configuration. 3. Initializes the model using weights from a specified checkpoint. 4. Executes the configured trainer’s eval() method, which runs all configured callbacks.

The outputs are stored in a new directory separate from the training outputs, preserving the original run data.

The CLI: noether-eval

The command is installed via the noether package. It utilizes Hydra for configuration management and supports merging multiple YAML files or command-line overrides.

Required Arguments

The script is also using Hydra. Config can be provided via the --hp flag or directly as CLI arguments. The following argument is required:

  • input_dir: The absolute or relative path to the directory of the training run you wish to evaluate. This directory must contain a hp_resolved.yaml file.

Examples

Basic evaluation using the latest checkpoint. This runs the same evaluation callbacks as configured during training:

noether-eval +input_dir=outputs/2026-01-10/10-00-00

Evaluation on a specific checkpoint:

noether-eval +input_dir=outputs/2026-01-10/10-00-00 resume_checkpoint=best_accuracy --hp configs/inference/visualization.yaml

Run evaluation with modified callbacks, for example to calculate offline losses on the test set:


# configs/inference/custom_eval_callbacks.yaml trainer:

callbacks: - kind: noether.training.callbacks.OfflineLossCallback

dataset_key: test name: OfflineLossCallback

Configuration Merging

`noether-eval` performs a deep merge of configurations: 1. Base: The stored config from the input_dir (looked up from``hp_resolved.yaml``) 2. Override: The configuration provided via --hp or direct CLI arguments.

This allows you to easily switch datasets, modify callback parameters, or change evaluation settings while keeping the model architecture and other training-time settings intact.

Inference Runner

The InferenceRunner (found in src/noether/inference/runners/inference_runner.py) is responsible for setting up the environment similarly to the training runner but with key differences:

  • Weight Loading: It uses the PreviousRunInitializer, which only loads model weights and skips optimizer/scheduler states.

  • Eval Mode: It calls trainer.eval(model) instead of trainer.train(model).

Trainer Evaluation Mode

All evaluation mode is doing is executing the configured callbacks on the saved model weights.

This way we can reuse the same callback implementations for both training and evaluation, ensuring consistency in metrics computation and visualization generation.

By default the same callbacks used during training are also used during evaluation. However, you can customize this by modifying the configuration passed to noether-eval to include different or additional callbacks as needed. This way you can, for example, add extra visualization callbacks or change metric logging behavior specifically for evaluation runs.