noether.training.cli.submit_job¶
Classes¶
Configuration for SLURM job submission via |
Functions¶
|
Validate the configuration using the schema declared by |
|
Entry point for |
Module Contents¶
- class noether.training.cli.submit_job.SlurmConfig(/, **data)¶
Bases:
pydantic.BaseModelConfiguration for SLURM job submission via
submitit.Field names mirror the keyword arguments accepted by
submitit.AutoExecutor.update_parameters(). All fields are optional and default toNone, meaning the cluster default is used.Note
Job stdout/stderr is owned by submitit and written to
<folder>/<job_id>_log.out/<folder>/<job_id>_log.err. Use thefolderfield to control where these files land. SLURM--output/--errordirectives are intentionally not exposed; pass them viaslurm_additional_parametersif you really need to override submitit’s defaults (this disablesjob.stdout()helpers).Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
data (Any)
- model_config¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- folder: str = 'submitit_logs'¶
Directory where submitit writes the job script, pickled task, and stdout/stderr logs. Per-job files are named
<job_id>_log.outetc. inside this directory. This is also used as the defaultoutput_pathfor training runs (seeConfigSchema.output_path).Supports
%u(current username) interpolation, e.g./home/%u/logs/experiment. SLURM job-time patterns like%jare not supported because submitit needs the directory to exist before submission.
- gpus_per_node: int | str | None = None¶
GPUs per node. Accepts a count or
type:count(e.g."a100:4").
- slurm_array_parallelism: int | None = None¶
Maximum number of array tasks running concurrently (SLURM
%Nin--array).
- slurm_setup: list[str] | None = None¶
Shell commands run inside the job before the main command, e.g.
["source .venv/bin/activate"].
- slurm_additional_parameters: dict[str, Any] | None = None¶
Escape hatch for SLURM directives not exposed as first-class fields, e.g.
{"nice": 0, "reservation": "my_res", "chdir": "/work"}. Keys are passed as--key=valuetosbatch.
- slurm_srun_args: list[str] | None = None¶
Extra arguments for the inner
srunlaunched by submitit. When left unset it defaults to_DEFAULT_SRUN_ARGS(['--cpu-bind=none']) to avoid SLURMUnable to satisfy cpu bind requestfailures on clusters whose GPU-partition CPU masks are non-contiguous — theresrun’s default per-task CPU binding cannot be satisfied and the job step aborts before the program starts. Set to[]to restoresrun’s default binding, or provide your own args.
- to_executor_kwargs()¶
Return
(folder, update_parameters_kwargs)forsubmitit.AutoExecutor.Generic fields are passed under their bare name; everything else keeps its
slurm_prefix so submitit routes it to the slurm executor.
- noether.training.cli.submit_job.validate_config(config)¶
Validate the configuration using the schema declared by
config_schema_kind.- Parameters:
config (omegaconf.DictConfig) – The composed Hydra configuration to validate.
- Returns:
The validated configuration schema instance.
- Raises:
ImportError – If the schema class cannot be imported.
ValidationError – If the configuration does not satisfy the schema.
- Return type:
- noether.training.cli.submit_job.main()¶
Entry point for
noether-train-submit-job.Validates a Hydra config (and every multirun combination thereof) and submits a training job — or a SLURM array job — via
submitit.- Return type:
None