How to launch a SLURM job from the command line¶
The noether-train-submit-job command validates a Hydra training config and
submits it to SLURM via submitit.
You pass the config path, optionally with Hydra-style overrides; the script
validates the result against your schema before any job is sent.
Single-job submission¶
noether-train-submit-job /path/to/recipes/aero_cfd/configs/train_shapenet.yaml \
+experiment/shapenet=transformer \
+seed=1 \
tracker=disabled \
dataset_root=/path/to/datasets/shapenet_car/
Sweeping with --multirun (SLURM array jobs)¶
Add --multirun (or -m) and use Hydra’s comma sweep syntax to launch a
single SLURM array job whose tasks span the cross-product of overrides. Every
combination is validated upfront on the login node — submission is aborted on
the first invalid combo, before SLURM sees anything.
noether-train-submit-job --hp configs/train_shapenet.yaml -m \
+seed=1,2,3 trainer.lr=1e-3,1e-4
The example above produces a 6-task SLURM array. Cap the concurrency with
slurm.slurm_array_parallelism.
Sweeping over fields under slurm: is not supported — the SLURM
allocation must be identical across all array tasks. The submitter rejects
configs whose slurm section differs between sweep combinations.
Previewing without submitting¶
noether-train-submit-job --hp configs/train_shapenet.yaml --dry-run
Prints the resolved submitit parameters and per-task command(s) without contacting SLURM.
The slurm section of the config¶
The slurm section is required. Field names mirror the keyword arguments
accepted by submitit.AutoExecutor.update_parameters(). An example:
slurm:
folder: /home/%u/logs/shapenet_car # %u = username; also default output_path
name: shapenet_experiment
slurm_partition: compute
nodes: 1
tasks_per_node: 1
cpus_per_task: 28
gpus_per_node: 1
mem_gb: 64 # gigabytes (float)
timeout_min: 720 # wall-clock minutes (int)
slurm_setup: # commands run before the main task
- source .venv/bin/activate
slurm_additional_parameters: # escape hatch for any sbatch flag
nice: 0
Common fields:
folder— directory where submitit writes the job script, pickled task, and stdout/stderr logs. Supports%u(current username) interpolation. Also serves as the defaultoutput_pathfor training runs whenoutput_pathis not set in the config. SLURM job-time patterns (%j,%A, etc.) are not supported.name— job name (--job-name).slurm_partition— partition to submit to.nodes/tasks_per_node/cpus_per_task/gpus_per_node— resource allocation per the obvioussbatchflags.gpus_per_nodeaccepts"a100:4"-style specs.mem_gb— memory per node in gigabytes (float).timeout_min— wall-clock limit in minutes (int).slurm_array_parallelism— cap concurrent tasks for--multirunarrays.slurm_setup— list of shell commands run inside the job before the main command (replaces the previousenv_pathfield).slurm_additional_parameters— dict for any other sbatch directive (nice,reservation,chdir,account,constraint, …). Keys are passed as--key=valuetosbatch.
Important
Job log files are owned by submitit. The --output and --error
sbatch directives are intentionally not exposed as first-class fields.
Submitit writes the job’s stdout/stderr to:
<slurm.folder>/<job_id>_log.out<slurm.folder>/<job_id>_log.err
For array jobs, <job_id> becomes <array_master_id>_<task_idx> (e.g.
42_0_log.out). To direct logs to a per-experiment location, point
slurm.folder at it. If you absolutely must override SLURM’s output
paths yourself, pass output/error via slurm_additional_parameters
— but be aware this disables submitit’s log-tailing helpers such as
job.stdout() and job.stderr().