How to Run Inference and Evaluation on Trained Models¶
noether-eval is the post-training callback runner: point it at a finished
training run, pick a checkpoint, and it re-executes the configured callbacks
against those weights. Whether that means evaluation (computing metrics on
a held-out split) or inference (writing predictions to disk for downstream
analysis) depends entirely on which callbacks are configured — the runner
itself doesn’t care, and most projects use it for both.
What this is good for¶
Evaluating on a held-out split — load the best checkpoint, re-run metric callbacks against the test set without re-training.
Generating predictions for downstream analysis — configure a callback with
save_predictions=True(e.g.AeroMetricsCallbackinrecipes/aero_cfd/) and write per-sample model outputs to disk for VTK export, force-coefficient computation, or reporting.Visualizing model behavior — re-run plotting/visualization callbacks on demand, against any checkpoint.
Re-running with different callbacks — keep the trained model and pipeline fixed, swap the callback list to a custom YAML for one-off analysis.
Despite the binary name, treat noether-eval as the generic
inference/evaluation entry point — it has no eval-only logic baked in.
For interactive work (notebooks, prototyping, debugging) where you don’t need callbacks, logging, or reproducibility, see Loading a run in Python (notebooks, prototyping) below — that path skips Hydra and the trainer entirely.
Quick start¶
Point run_dir at the training run output directory (the folder that
contains hp_resolved.yaml and a checkpoints/ subfolder):
noether-eval run_dir=outputs/2026-01-10_abc12
That’s the whole minimum invocation. By default it loads the latest checkpoint, reuses every callback that was active during training, and writes outputs alongside the original training run.
What run_dir should point at¶
A training run is laid out as output_path/run_id[/stage_name]. run_dir
is the deepest directory in that chain — the one that contains
hp_resolved.yaml:
outputs/
2026-01-10_abc12/ ← run_id
train/ ← stage_name (optional)
hp_resolved.yaml ← run_dir = outputs/2026-01-10_abc12/train
checkpoints/
tracker/
If the run was trained without a stage_name, the run_id directory
itself is the run_dir.
Overriding the training config¶
noether-eval is intentionally flexible: the resolved training config is
re-used as the base, and any key in it can be overridden on the command
line — no Hydra + prefix needed, exactly the same syntax as
noether-train. A handful of common patterns follow.
Pick a different checkpoint¶
By default noether-eval loads the latest checkpoint. To pick another:
# The best checkpoint, by metric. BestCheckpointCallback writes files like
# `<model>_cp=best_model.<metric_with_slashes_flattened_to_dots>_model.th`,
# so the tag is `best_model.<metric>`. Look up the configured `metric_key`
# in the run's hp_resolved.yaml — for `metric_key: loss/test/total`:
noether-eval run_dir=outputs/2026-01-10_abc12 resume_checkpoint=best_model.loss.test.total
# An exponential-moving-average snapshot
noether-eval run_dir=outputs/2026-01-10_abc12 resume_checkpoint=latest_ema=0.9999
# A specific epoch / update / step
noether-eval run_dir=outputs/2026-01-10_abc12 resume_checkpoint=E100_U5000_S5000
Send outputs somewhere else¶
The run writes outputs to output_path/run_id/stage_name — by default the
same folder as the training run. Override output_path, stage_name (or
both) to redirect:
# Sibling stage next to the training run
# → outputs/2026-01-10_abc12/eval/
noether-eval run_dir=outputs/2026-01-10_abc12/train stage_name=eval
# Different output root
# → /scratch/eval_runs/2026-01-10_abc12/train/
noether-eval run_dir=outputs/2026-01-10_abc12/train output_path=/scratch/eval_runs
# Both — different root and stage
noether-eval run_dir=outputs/2026-01-10_abc12/train \
output_path=/scratch/eval_runs stage_name=eval
In every case noether-eval still loads the checkpoint from run_dir —
only the output location changes.
Run on different hardware¶
You don’t have to re-run on the same accelerator as training. Override
accelerator (and optionally devices):
# Apple Silicon
noether-eval run_dir=outputs/2026-01-10_abc12 accelerator=mps
# CPU-only spot check
noether-eval run_dir=outputs/2026-01-10_abc12 accelerator=cpu
# A specific subset of GPUs
noether-eval run_dir=outputs/2026-01-10_abc12 devices="0,1"
Switch experiment tracking¶
Replace the training-time tracker with a disabled or local one for a quick re-run, or send results to a different W&B project:
noether-eval run_dir=outputs/2026-01-10_abc12 tracker=disabled
noether-eval run_dir=outputs/2026-01-10_abc12 tracker.project=eval-only
Tweak any other training-time key¶
The same dot-path syntax works for nested keys:
# Re-evaluate at a smaller batch size
noether-eval run_dir=outputs/2026-01-10_abc12 trainer.effective_batch_size=1
# Point at a different dataset root (e.g. moved data)
noether-eval run_dir=outputs/2026-01-10_abc12 dataset_root=/new/data/path
Most callbacks honor these overrides because they reuse the same trainer config keys they did at training time.
Customizing the callbacks¶
By default noether-eval reuses the callbacks from training. To add
post-training-only callbacks — whether for extra metrics, prediction saving,
or visualization — write a small YAML and pass it via --hp.
Evaluation example — add an offline test-set loss:
# configs/eval_extra.yaml
trainer:
callbacks:
- kind: noether.training.callbacks.OfflineLossCallback
dataset_key: test
every_n_epochs: 1
Inference example — save denormalized predictions to disk for downstream analysis (VTK export, force coefficient computation, plotting):
# configs/save_predictions.yaml
trainer:
callbacks:
- kind: aero_cfd.callbacks.AeroMetricsCallback
dataset_key: test
every_n_epochs: 1
forward_properties: ${model.forward_properties}
save_predictions: true
predictions_path: ./predictions
Run with either:
noether-eval run_dir=outputs/2026-01-10_abc12 --hp configs/eval_extra.yaml
noether-eval run_dir=outputs/2026-01-10_abc12 --hp configs/save_predictions.yaml
When --hp is supplied, that file becomes the Hydra base config — use this
escape hatch for power users composing their own inference/eval pipeline. The
run_dir argument and CLI overrides above still work the same way.
Flipping a flag on an existing callback (no YAML) — when the change is
just enabling a feature already supported by a configured callback (e.g.
turning on save_predictions for an AeroMetricsCallback that was
trained with metrics-only), use a dotted index override. Hydra’s ++
prefix is required because the keys aren’t in the loaded yaml (training left
them at defaults):
# Index 4 = position of AeroMetricsCallback in trainer.callbacks
noether-eval run_dir=outputs/<run_id>/train \
++trainer.callbacks.4.save_predictions=true \
++trainer.callbacks.4.predictions_path=/path/to/preds
The list index reflects the callback’s position in the source’s
hp_resolved.yaml trainer.callbacks list — open the file to confirm.
How it works¶
noether-eval is a thin post-training callback runner. Under the hood it:
Reads
hp_resolved.yamlfromrun_dir, which captures the full, resolved training configuration.Wires that config in as the Hydra base, so every training-time key is a valid override target on the command line.
Forces
resume_run_id/resume_stage_nameto point at the training run, defaultsresume_checkpointtolatest, and applies any user overrides on top.Hands the resolved config to
InferenceRunner, which sets up the trainer/model/tracker the same way training does but usesPreviousRunInitializerto load only the model weights (no optimizer/scheduler state), then callstrainer.eval(model)instead oftrainer.train(model).
trainer.eval() simply iterates the configured callbacks against the
restored weights — there is no separate eval loop, and nothing in the runner
distinguishes “evaluation” from “inference”. Whichever callbacks are
configured (metric computation, prediction saving, visualization) decide what
the run actually produces. Callbacks that aren’t meaningful here (e.g.
checkpoint saving) are no-ops; callbacks can branch on interval_type ==
"eval" inside
_periodic_callback()
if they need to behave differently outside of training. See
noether.core.callbacks.periodic.PeriodicCallback and
noether.core.callbacks.periodic.PeriodicIteratorCallback for the
callback protocol.
Recipe code on PYTHONPATH¶
If the training run referenced classes from a recipe (e.g.
aero_cfd.pipeline.AeroMultistagePipeline), run noether-eval from a
working directory where those imports resolve — typically the recipe folder
itself, or with PYTHONPATH set:
cd recipes/aero_cfd
noether-eval run_dir=/path/to/outputs/2026-01-10_abc12 tracker=disabled
Each run directory also contains a code.tar.gz snapshot of the codebase at
training time, useful when the source tree has drifted.
Loading a run in Python (notebooks, prototyping)¶
When you want to poke at a trained model in a notebook — inspect predictions
on a single sample, prototype a new visualization, debug a head-scratcher —
noether-eval is overkill: it stands up Hydra, the trainer, the tracker,
and the callback loop just to give you a model and a dataset.
The noether.inference package exposes a single Run class
for that case. The model, normalizers, and (optionally) dataset are built
on demand — no Hydra, no trainer.
Run has two construction modes, picked by which constructor you call:
Run(run_dir)— open a full training output directory (hp_resolved.yaml+checkpoints/). Everything below is available.Run.from_checkpoint(path)— open just a single..._model.thfile. Every checkpoint written by noether embeds the model config and the per-field normalizer specs + statistics, which is enough formodel()andnormalizers()on their own.dataset(),config, andstatisticsraise in this mode.
Run-dir mode — full access to config + dataset:
from noether.inference import Run
run = Run("/path/to/outputs/2026-01-10_abc12")
# Optional: patch the config before building artifacts —
# typically to point dataset paths at this machine's data.
for ds_cfg in run.config.datasets.values():
ds_cfg.root = "/local/path/to/data"
dataset = run.dataset("test")
model = run.model(checkpoint="latest", device="cuda")
# checkpoint examples: "latest", "best_model.<metric>", "E10", "latest_ema=0.9999"
Checkpoint-only mode — just the model + normalizers, from a single
.th file:
from noether.inference import Run
run = Run.from_checkpoint("/path/to/.../checkpoints/ab_upt_cp=latest_model.th")
model = run.model(device="cuda")
norms = run.normalizers()
Run exposes three lazy methods, all independent — you don’t have to
call them in order, and you don’t have to call them all. Pick whichever
fit your use case:
run.model(...)— the trained model with checkpoint weights loaded. Works in both modes; works on any tensor dict you can construct.run.normalizers(split)— the field normalizers (e.g. for converting model predictions back to physical units). Works in both modes; thesplitargument is ignored in checkpoint-only mode (the normalizer payload embedded in the checkpoint is global). Built without instantiating the dataset; the data files do not need to be present.run.dataset(split)— the dataset, with the same collator the trainer wired. Run-dir mode only — this is the one that needs the original data files on disk. Accessingrun.configandrun.statisticsalso requires run-dir mode.
That separation matters in particular for the bring-your-own-data flow
— applying a trained model to a CAD mesh, a custom point cloud, or any
data that isn’t packaged as a noether Dataset. Checkpoint-only mode is
the natural entry point: no run directory, no hp_resolved.yaml, no
stats file:
run = Run.from_checkpoint("/path/to/.../checkpoints/ab_upt_cp=latest_model.th")
model = run.model(device="cuda")
norms = run.normalizers()
# You build the input dict yourself, matching the model's forward signature.
with torch.inference_mode():
pred = model(**my_inputs)
# Same normalizers the training data used — denormalize the prediction.
pressure_phys = norms["surface_pressure"].inverse(pred["surface_pressure"])
This is not a substitute for noether-eval: there are no metrics,
no callbacks, no run output directory, and no reproducibility guarantees.
Use it for interactive work; use noether-eval for everything else.
A worked example — load a trained AB-UPT / DrivAerML run via both
Run.from_checkpoint (checkpoint-only flow, end-to-end with raw
tensors) and Run(run_dir) (full Dataset/Pipeline flow), and plot
predictions vs. ground truth — lives at
notebooks/ab_upt_drivaerml_inference.ipynb.
A note on --help and the binary name¶
noether-eval is built on Hydra rather than Typer, so noether-eval --help
prints Hydra’s generic help text rather than a curated list of arguments —
this guide is the practical reference. The binary is also still spelled
noether-eval for brevity even though the underlying machinery
(InferenceRunner, the
noether.inference module) reflects its broader inference + evaluation
scope.