noether.core.schemas.slurm

Classes

SlurmConfig

Configuration for SLURM job submission via submitit.

Module Contents

class noether.core.schemas.slurm.SlurmConfig(/, **data)

Bases: pydantic.BaseModel

Configuration for SLURM job submission via submitit.

Field names mirror the keyword arguments accepted by submitit.AutoExecutor.update_parameters(). All fields are optional and default to None, meaning the cluster default is used.

Note

Job stdout/stderr is owned by submitit and written to <folder>/<job_id>_log.out / <folder>/<job_id>_log.err. Use the folder field to control where these files land. SLURM --output/--error directives are intentionally not exposed; pass them via slurm_additional_parameters if you really need to override submitit’s defaults (this disables job.stdout() helpers).

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

folder: str = 'submitit_logs'

Directory where submitit writes the job script, pickled task, and stdout/stderr logs. Per-job files are named <job_id>_log.out etc. inside this directory. This is also used as the default output_path for training runs (see ConfigSchema.output_path).

Supports %u (current username) interpolation, e.g. /home/%u/logs/experiment. SLURM job-time patterns like %j are not supported because submitit needs the directory to exist before submission.

name: str | None = None

Job name (SLURM --job-name).

nodes: int | None = None

Number of nodes to allocate.

tasks_per_node: int | None = None

Number of tasks per allocated node.

cpus_per_task: int | None = None

Number of CPUs per task.

gpus_per_node: int | str | None = None

GPUs per node. Accepts a count or type:count (e.g. "a100:4").

mem_gb: float | None = None

Memory per node in gigabytes.

timeout_min: int = 0

Wall-clock limit in minutes. Use 0 for no time limit

stderr_to_stdout: bool | None = None

If True, merge stderr into stdout.

slurm_partition: str | None = None

Partition to submit the job to.

slurm_array_parallelism: int | None = None

Maximum number of array tasks running concurrently (SLURM %N in --array).

slurm_setup: list[str] | None = None

Shell commands run inside the job before the main command, e.g. ["source .venv/bin/activate"].

slurm_additional_parameters: dict[str, Any] | None = None

Escape hatch for SLURM directives not exposed as first-class fields, e.g. {"nice": 0, "reservation": "my_res", "chdir": "/work"}. Keys are passed as --key=value to sbatch.

to_executor_kwargs()

Return (folder, update_parameters_kwargs) for submitit.AutoExecutor.

Generic fields are passed under their bare name; everything else keeps its slurm_ prefix so submitit routes it to the slurm executor.

Returns:

A tuple (folder, kwargs) where folder is the executor’s log directory and kwargs is the dict to splat into executor.update_parameters(**kwargs).

Return type:

tuple[str, dict[str, Any]]