Running the Experiments¶
For a general guide on launching training jobs, see How to launch a SLURM job from the command line.
Important
All commands in this section must be run from inside the recipe folder (recipes/aero_cfd/).
Running SLURM jobs¶
The Noether Framework provides the noether-train-submit-job CLI for submitting SLURM jobs.
It reads SLURM parameters from a slurm config group in your experiment configuration.
This recipe defines its SLURM defaults in configs/slurm/slurm_config.yaml:
nodes: 1
cpus_per_task: 28
partition: compute
gpus_per_node: 1
ntasks_per_node: 1
mem: 64GB
output: /home/%u/logs/shapenet_car/%x_%j.out
nice: 0
job_name: shapenet_experiment
chdir:
env_path: .venv/bin/activate
For a detailed guide on configuring and using noether-train-submit-job, see
How to launch a SLURM job from the command line.
Alternatively, this recipe includes hand-written job scripts. To run all the models for ShapeNet-Car:
sbatch jobs/train_shapenet.job
The same applies to jobs/train_ahmedml.job and jobs/train_drivaerml.job, which can be found in the
jobs/ directory.
We also provide the config files to run the experiments for DrivAerNet++ (train_drivaernet.yaml) and the Emmi-Wing (train_wing.yaml), however, those experiments are not part of this walkthrough.
Warning
This assumes you have access to a SLURM-based system. If not, please review the job files to see the commands used to run the experiments.
Job arrays:
In the jobs/experiments/ folder, we define job arrays (i.e., arrays with different
experiments/jobs) for all the experiments we want to run. You can add extra rows with different
seeds or experiment variants to these *.txt files as needed.
The flag #SBATCH --array=... defines how to run the job array:
#SBATCH --array=1-10: Runs rows 1 to 10 from./jobs/experiments/shapenet_experiments.txt#SBATCH --array=1,5,9: Runs rows 1, 5, and 9#SBATCH --array=1-10%5: Runs rows 1 to 10, but with a maximum of 5 jobs running simultaneously. When one of the 5 jobs finishes, the next job in the array will start. This is especially useful for large job arrays when you don’t want to occupy the entire cluster.
Running a single experiment¶
To run a single experiment, execute the following command:
uv run noether-train \
--hp configs/train_shapenet.yaml \
+experiment/shapenet=transformer tracker=disabled +seed=1
Important
Please set the dataset_root in either the config files or via the command line override.
Running multi-GPU experiments¶
When running outside of SLURM, use uv run noether-train as shown above. This will spawn one
process for every GPU that is available on the system and visible via CUDA_VISIBLE_DEVICES.
You can also configure the devices by adding devices="0,1,2,4", for example, to the root
config.
Important
If you train on more than 1 GPU, ensure that effective_batch_size is at least equal to
the number of GPUs used. Multi-node training is currently not supported.
Example of a multi-GPU SLURM job:
srun --nodes=1 --partition=compute --gpus-per-node=2 --mem=64GB \
--ntasks-per-node=2 --kill-on-bad-exit=1 --cpus-per-task=28 \
uv run noether-train \
--hp configs/train_shapenet.yaml \
+experiment/shapenet=transformer tracker=disabled \
trainer.effective_batch_size=2
Running inference¶
To run evaluation callbacks on trained models, use the noether-eval CLI tool.
For detailed instructions on running inference with trained models, refer to How to Run Evaluation on Trained Models.
Resuming training after interruption¶
To resume training after an error or interruption, simply add resume_run_id: <RUN_ID>
(and resume_stage_name if a stage_name was used in the previous run) to the training
configuration (either in the YAML file or via the CLI). Training will continue from the last
saved epoch checkpoint.
Example:
uv run noether-train \
--hp configs/train_shapenet.yaml \
+experiment/shapenet=transformer \
resume_run_id=<run_id> resume_stage_name=<stage_name>
Optionally, you can change the stage_name to make it clear that checkpoints stored for this
run are from a continued training run.
Initializing model weights¶
To initialize a model with weights from a previous training run, add an initializer configuration to the model config:
model:
# ... model configuration
initializers:
- kind: noether.core.initializers.PreviousRunInitializer
run_id: <run_id>
model_name: ab_upt
checkpoint_tag: latest # Options: 'latest', 'best', or specific checkpoint
Required parameters:
run_id: The run identifier from the previous training runmodel_name: The name of the model to load weights fromcheckpoint_tag: Which checkpoint to use (latest,best, or a specific epoch number)
Optional parameters:
model_info: Additional checkpoint metadata (e.g.,ema=0.9999for exponential moving average weights, or specific loss metric identifiers for best checkpoints). Leave empty for standard checkpoints.
WandB tracker¶
We implemented a Weights and Biases (WandB) tracker to log during training and evaluation:
kind: noether.core.trackers.WandBTracker
entity: null # ADD YOUR_WANDB_ENTITY_HERE
project: null # ADD YOUR_WANDB_PROJECT_HERE
Simply add your own WandB entity and project to start logging.
For more details on experiment tracking, see Experiment Tracking.
Extra utilities and tips¶
Output path: The output path is undefined by default and must be configured. In this walkthrough, we set it to
./outputs. The Noether Framework will use the generatedrun_idto store the checkpoints for each training run in subfolders.Physics features: You can set
physics_featurestotruefor the multi-stageAeroMultistagePipeline. This only works for ShapeNet-Car and will add the SDF and normal vectors to the coordinate inputs. However, we never properly utilized these features in our experiments, and they are not implemented for other datasets.Code snapshots: By default, a snapshot of the codebase is stored as part of the checkpoints for reproducibility.
Batch size considerations: Almost all experiments we ran for the AB-UPT paper use a batch size of 1. However, the data loading pipeline is implemented to work with batches larger than 1 (including with physics features). Note that we never thoroughly validated these results or checked for potential training/data loading instabilities with larger batch sizes.
Effective batch size and gradient accumulation: The
effective_batch_sizeparameter defines the total number of samples processed before performing an optimizer step (also known as the “global batch size”). In multi-GPU setups, the local batch size per device is calculated aseffective_batch_size / number of GPUs. When gradient accumulation is enabled, the batch size is further divided by the number of accumulation steps. To enable gradient accumulation, set themax_batch_sizeparameter. For example, withmax_batch_size=2andeffective_batch_size=8, the framework will perform 4 gradient accumulation steps before updating the model weights.