noether.training.callbacks.profiler¶

Classes¶

`PyTorchProfilerCallbackConfig`	Configuration for the PyTorch profiler callback.
`PyTorchProfilerCallback`	Profiles the training loop with `torch.profiler.profile`.

Module Contents¶

class noether.training.callbacks.profiler.PyTorchProfilerCallbackConfig(/, **data)¶

Bases: noether.core.callbacks.base.CallBackBaseConfig

Configuration for the PyTorch profiler callback.

The profiler uses torch.profiler.profile with a scheduled trace. Profiling is driven off of track_after_update_step hooks, i.e. the profiler is stepped once per optimizer update. The resulting traces are written to <run_output_path>/profiler and can be opened in TensorBoard or chrome://tracing.

Recommended usage: limit training with trainer.max_updates to a value slightly larger than wait + warmup + active (times repeat if > 1).

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)

kind: str | None = 'aero_cfd.callbacks.PyTorchProfilerCallback'¶

wait: int = None¶: Number of steps to idle before warming up.

warmup: int = None¶: Number of warmup steps (profiler runs but traces are discarded).

active: int = None¶: Number of active steps that are recorded in the trace.

repeat: int = None¶: Number of times the (wait, warmup, active) cycle is repeated. 0 means repeat indefinitely.

record_shapes: bool = None¶: Whether to record input tensor shapes for each op.

profile_memory: bool = None¶: Whether to profile tensor memory usage (can add significant overhead).

with_stack: bool = None¶: Whether to record Python call stacks for each op (can add significant overhead).

with_flops: bool = None¶: Whether to record estimated FLOPs for each op.

with_modules: bool = None¶: Whether to record nn.Module hierarchy for each op.

profile_cuda: bool = None¶: Whether to profile CUDA operations. If False, only CPU operations are profiled.

profile_cpu: bool = None¶: Whether to profile CPU operations. If False, only CUDA operations are profiled.

trace_subdir: str = None¶: Subdirectory (relative to run_output_path) where the trace files are written.

rank0_only: bool = None¶: If True, only rank 0 profiles (noop on other ranks). Avoids noisy/conflicting traces in multi-GPU runs.

class noether.training.callbacks.profiler.PyTorchProfilerCallback(callback_config, **kwargs)¶

Bases: noether.core.callbacks.periodic.PeriodicCallback

Profiles the training loop with torch.profiler.profile.

The profiler is entered in before_training(), stepped once per optimizer update in track_after_update_step(), and exited in after_training(). Traces are written to <run_output_path>/<trace_subdir> via tensorboard_trace_handler and can be loaded in TensorBoard (tensorboard --logdir <path>) or inspected in chrome://tracing.

Note

every_n_updates=1 must be set so that track_after_update_step is called on every update (any every_n_* value works — it only gates the unused periodic_callback hook, not the tracking hooks).

Example

callbacks:
- kind: callbacks.PyTorchProfilerCallback
    every_n_updates: 1
    wait: 1
    warmup: 1
    active: 3
    repeat: 2
    record_shapes: true
    profile_memory: false
    with_stack: false
    with_flops: false
    with_modules: true
    activities:
    - cpu
    - cuda

Parameters:

callback_config (PyTorchProfilerCallbackConfig) – Configuration for the callback. See CallBackBaseConfig for available options.
trainer – Trainer of the current run.
model – Model of the current run.
data_container – DataContainer instance that provides access to all datasets.
tracker – BaseTracker instance to log metrics to stdout/disk/online platform.
log_writer – LogWriter instance to log metrics.
checkpoint_writer – CheckpointWriter instance to save checkpoints.
metric_property_provider – MetricPropertyProvider instance to access properties of metrics.
name – Name of the callback.

before_training(*, update_counter)¶

Hook called once before the training loop starts.

This method is intended to be overridden by derived classes to perform initialization tasks before training begins. Common use cases include:

Initializing experiment tracking (e.g., logging hyperparameters)
Printing model summaries or architecture details
Initializing specific data structures or buffers needed during training
Performing sanity checks on the data or configuration

Note

This method is executed within a torch.no_grad() context.

Parameters:: update_counter (noether.core.utils.training.counter.UpdateCounter) – UpdateCounter instance to access current training progress.
Return type:: None

track_after_update_step(*, update_counter, times)¶

Hook called after each optimizer update step.

This method is invoked after a successful optimizer step and parameter update. It is typically used for tracking metrics that should be recorded once per update cycle, such as:

Latest loss values
Learning rates
Model parameter statistics (norms, etc.)
Training throughput and timing measurements

Unlike periodic_callback(), this hook is called on every update step, making it suitable for maintaining running averages or high-frequency telemetry.

Note

This method is executed within a torch.no_grad() context.

Parameters:

update_counter (noether.core.utils.training.counter.UpdateCounter) – UpdateCounter instance to access current training progress.
times (dict[str, float]) – Dictionary containing time measurements for various parts of the training step (e.g., ‘data_time’, ‘forward_time’, ‘backward_time’, ‘update_time’).

Return type:

None

after_training(*, update_counter)¶

Hook called once after the training loop finishes.

This method is intended to be overridden by derived classes to perform cleanup or final reporting tasks after training is complete. Common use cases include:

Performing a final evaluation on the test set
Saving final model weights or artifacts
Sending notifications (e.g., via Slack or email) about the completed run
Closing or finalizing experiment tracking sessions

Note

This method is executed within a torch.no_grad() context.

Parameters:: update_counter (noether.core.utils.training.counter.UpdateCounter) – UpdateCounter instance to access current training progress.
Return type:: None