noether.training.cli.submit_job¶

Classes¶

SlurmConfig

Configuration for SLURM job submission via submitit.

Functions¶

`validate_config`(config)	Validate the configuration using the schema declared by `config_schema_kind`.
`main`()	Entry point for `noether-train-submit-job`.

Module Contents¶

class noether.training.cli.submit_job.SlurmConfig(/, **data)¶

Bases: pydantic.BaseModel

Configuration for SLURM job submission via submitit.

Field names mirror the keyword arguments accepted by submitit.AutoExecutor.update_parameters(). All fields are optional and default to None, meaning the cluster default is used.

Note

Job stdout/stderr is owned by submitit and written to <folder>/<job_id>_log.out / <folder>/<job_id>_log.err. Use the folder field to control where these files land. SLURM --output/--error directives are intentionally not exposed; pass them via slurm_additional_parameters if you really need to override submitit’s defaults (this disables job.stdout() helpers).

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)

model_config¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

folder: str = 'submitit_logs'¶

Directory where submitit writes the job script, pickled task, and stdout/stderr logs. Per-job files are named <job_id>_log.out etc. inside this directory. This is also used as the default output_path for training runs (see ConfigSchema.output_path).

Supports %u (current username) interpolation, e.g. /home/%u/logs/experiment. SLURM job-time patterns like %j are not supported because submitit needs the directory to exist before submission.

name: str | None = None¶: Job name (SLURM --job-name).

nodes: int | None = None¶: Number of nodes to allocate.

tasks_per_node: int | None = None¶: Number of tasks per allocated node.

cpus_per_task: int | None = None¶: Number of CPUs per task.

gpus_per_node: int | str | None = None¶: GPUs per node. Accepts a count or type:count (e.g. "a100:4").

mem_gb: float | None = None¶: Memory per node in gigabytes.

timeout_min: int = 0¶: Wall-clock limit in minutes. Use 0 for no time limit

stderr_to_stdout: bool | None = None¶: If True, merge stderr into stdout.

slurm_partition: str | None = None¶: Partition to submit the job to.

slurm_array_parallelism: int | None = None¶: Maximum number of array tasks running concurrently (SLURM %N in --array).

slurm_setup: list[str] | None = None¶: Shell commands run inside the job before the main command, e.g. ["source .venv/bin/activate"].

slurm_additional_parameters: dict[str, Any] | None = None¶: Escape hatch for SLURM directives not exposed as first-class fields, e.g. {"nice": 0, "reservation": "my_res", "chdir": "/work"}. Keys are passed as --key=value to sbatch.

slurm_srun_args: list[str] | None = None¶: Extra arguments for the inner srun launched by submitit. When left unset it defaults to _DEFAULT_SRUN_ARGS (['--cpu-bind=none']) to avoid SLURM Unable to satisfy cpu bind request failures on clusters whose GPU-partition CPU masks are non-contiguous — there srun’s default per-task CPU binding cannot be satisfied and the job step aborts before the program starts. Set to [] to restore srun’s default binding, or provide your own args.

to_executor_kwargs()¶

Return (folder, update_parameters_kwargs) for submitit.AutoExecutor.

Generic fields are passed under their bare name; everything else keeps its slurm_ prefix so submitit routes it to the slurm executor.

Returns:: A tuple (folder, kwargs) where folder is the executor’s log directory and kwargs is the dict to splat into executor.update_parameters(**kwargs).
Return type:: tuple[str, dict[str, Any]]

noether.training.cli.submit_job.validate_config(config)¶

Validate the configuration using the schema declared by config_schema_kind.

Parameters:

config (omegaconf.DictConfig) – The composed Hydra configuration to validate.

Returns:

The validated configuration schema instance.

Raises:

ImportError – If the schema class cannot be imported.
ValidationError – If the configuration does not satisfy the schema.

Return type:

noether.core.schemas.schema.ConfigSchema

noether.training.cli.submit_job.main()¶

Entry point for noether-train-submit-job.

Validates a Hydra config (and every multirun combination thereof) and submits a training job — or a SLURM array job — via submitit.

Return type:: None