Diffusion UQ on DrivAerML

Uncertainty quantification on surface pressure

Generative uncertainty quantification for steady CFD fields on top of the AB-UPT backbone via data-space diffusion — denoise surface + volume fields directly at anchor positions, with one backbone forward per Euler step. An ensemble of draws yields per-point std, calibration of integrated quantities (Cd, Cl).

The recipe is a uv-managed package (pyproject.toml) that pulls emmiai-noether and aero-cfd-recipe as editable path deps, so it picks up the surrounding repo’s src/ and the sibling aero_cfd recipe automatically.

Method

We train a flow matching model in the data space of field values. Concretely:

  • Objective. FM Velocity prediction along the straight-line rectified-flow interpolant between Gaussian noise and the clean target. One Euler step = one AB-UPT forward.

  • Conditioning. Geometry comes in via the standard AB-UPT supernode + anchor pipeline; the diffusion time t [0, 1] enters via DiT-style (Peebles & Xie, 2022) adaptive layer norm on every transformer block. Noisy field values are concatenated to the anchor inputs.

  • Optional minibatch-OT pairing. --minibatch-ot pairs noise / target samples within a batch by optimal transport (Tong et al., 2023) for straighter, lower-variance trajectories.

  • UQ. A trained model is sampled N times per geometry from independent noise seeds; the per-point sample std is the uncertainty estimate, and the ensemble mean is the point prediction.

Training

Data-space diffusion (flow matching)

uv run python -m scripts.train_dataspace_diffusion \
    --dataset-root $DATASET \
    --output-path ./outputs/abupt_diffusion \
    --paradigm flow_matching \
    --max-epochs 500 --batch-size 1 --lr 5e-5 \
    --ema-decays 0.9999

Defaults match the reported run: 9.1M params, hidden 192, 3 heads, 16K supernodes, 65K geometry points, 16K surface + 16K volume anchors. Pass --minibatch-ot to enable OT pairing.

Regression baseline

--paradigm regression swaps in AeroABUPT + WeightedLossTrainer for a deterministic baseline at matched backbone, geometry budget, and optimizer schedule. run_abupt_regression.sbatch.

SLURM

File

Stage

run_abupt_diffusion.sbatch

data-space diffusion (full hyperparams baked in)

run_abupt_regression.sbatch

deterministic AeroABUPT baseline

Both launchers prepend $SLURM_JOB_ID to config.run_id so wandb and output dirs trace back to the job.

Evaluation

Single-GPU eval scripts that load a trained checkpoint and dispatch in-process via noether.inference.evaluate:

Script

What it does

scripts/eval_sampling_steps_sweep.py

Side-by-side metrics for multiple Euler-step counts (--steps 1 2 4 8 16) in one pass; emits per-step steps_{n}/ prefixes to wandb.

scripts/eval_uq.py

Draws --n-uq-samples per geometry; reports per-point std-vs-|error| Pearson correlation and Cd / Cl mean / std / empirical-CI calibration.

scripts/report.py

Builds the tables and Plotly bar charts shown below from summary.yaml artifacts.

Example:

uv run python -m scripts.eval_uq \
    --run-dir outputs/abupt_diffusion/30035_2026-05-11_spk1e \
    --n-uq-samples 10 --sampling-steps 10 \
    --resume-checkpoint best_model.loss.test.total

At eval time eval_uq re-enables surface_normals, surface_area, and use_surface_position_as_input (training omits them) so Cd / Cl integration finds its inputs.

Report

Final test-set L2 errors for the data-space diffusion model (10 Euler sampling steps) versus the deterministic regression baseline. The two runs share the same AB-UPT backbone, geometry budget, and optimizer schedule, so wall-clock training cost and parameter count are roughly matched — the diffusion model just adds the per-anchor noise input and DiT-style time conditioning on top.

[!NOTE] Caveat. The learning rate was not swept for these experiments; it was reused as-is from the regression configuration (5e-5). Flow-matching objectives have a different loss landscape than direct regression and typically tolerate (and benefit from) larger LRs and longer schedules. The numbers below should therefore be read as a lower bound on what the FM model can achieve — a tuned LR and a longer training budget would very likely close, and on the smooth fields possibly reverse, the gap to the regression baseline.

Field

Regression baseline

Diffusion (10 steps)

Δ (diff − base)

Relative

surface_friction

0.0905

0.0973

+0.0069

+7.6%

surface_pressure

0.0473

0.0498

+0.0025

+5.4%

volume_pressure

0.0790

0.0872

+0.0082

+10.4%

volume_velocity

0.0753

0.0800

+0.0047

+6.3%

volume_vorticity

1.1150

0.4173

-0.6977

-62.6%

Diffusion vs regression baseline

The evaluation was on the whole-mesh of the test split of our subsampled DrivAerML testset using only anchors for inference. The difference between baseline and flow-matching models on all primary metrics is marginal. The exception is volume_vorticity, where the diffusion does significantly better. This can mostly attributed to non-ideal hyperparamters of the baseline and high variance of this property.

Sampling-steps sweep

Single-draw deterministic L2 errors (loss/test/<field>_l2err/last) read from the eval_steps<N>/ siblings of the diffusion run — i.e. the same checkpoint sampled with different Euler-step counts. Useful for picking the cheapest step count that still hits the accuracy floor.

Field

1 step

2 steps

3 steps

4 steps

5 steps

8 steps

16 steps

surface_friction

0.2183

0.0963

0.0932

0.0932

0.0938

0.0962

0.1001

surface_pressure

0.1387

0.0522

0.0492

0.0487

0.0488

0.0497

0.0514

volume_pressure

0.2263

0.0893

0.0854

0.0853

0.0858

0.0879

0.0912

volume_velocity

0.1957

0.0769

0.0752

0.0757

0.0766

0.0792

0.0832

volume_vorticity

7.7955

0.3061

0.2545

0.2466

0.2485

0.2659

0.3050

Per-field L2 error vs Euler-step count

One Euler step is far off (the flow can’t resolve volume_vorticity at all), but two steps already collapse the error to near-optimal. The minimum sits at 3–4 steps for every field; beyond that, accuracy drifts back up slightly — the rectified-flow trajectory is approximated more finely, but per-step error accumulates faster than the discretization gain. Rectified flow’s straight-line interpolant is the reason 2 steps already get close: under a perfectly straight trajectory, one Euler step would be exact.

Uncertainty quantification

For each field, we draw 6 samples per geometry (4 Euler steps) and compute the per-point standard deviation across draws. The metric below is the Pearson correlation between that per-point std and the per-point absolute error against ground truth on the test set (loss/chunked_test/<field>_uq_corr/last). A high value means the model’s own spread is a reliable proxy for where it’s wrong — i.e. the uncertainty is calibrated in shape, even if not in scale.

Field

Pearson r(std, |error|)

surface_friction

0.579

surface_pressure

0.369

volume_pressure

0.559

volume_velocity

0.445

volume_vorticity

0.501

Surface friction std vs |error|

All correlations are positive — the ensemble std flags the right regions in every field — but they sit in the 0.37–0.58 range, so the signal is informative rather than precise. surface_friction and volume_pressure calibrate best; surface_pressure is the weakest, consistent with it also being the field where the ensemble mean lags the regression baseline most (+65.5%). The Cd / Cl coverage numbers in the same summary show the complementary scale problem: the integrated-coefficient std is too tight (empirical 1-σ coverage is 0, mean |z| ≈ 8–10), so the per-point shape is useful for triage but the scale would need recalibration (e.g. temperature scaling on the std, or a held-out calibration set) before being used for hard confidence intervals.

Project structure

diffusion_uq/
├── pyproject.toml                       # uv package: noether + aero_cfd as editable deps
├── experiments.py                       # build_abupt_config — single factory for regression + diffusion
├── models/diffusion_abupt.py            # DiffusionABUPT (data-space, DiT-style time modulation)
├── trainer/diffusion_ab_upt_trainer.py  # DiffusionABUPTTrainer + schedule_config
├── callbacks/
│   ├── dataspace_diffusion_chunked_eval.py   # periodic + full-mesh chunked eval
│   └── dataspace_diffusion_uq.py             # ensemble UQ (std-vs-error, Cd/Cl calibration)
├── scripts/
│   ├── train_dataspace_diffusion.py
│   ├── eval_sampling_steps_sweep.py
│   ├── eval_uq.py
│   └── report.py                        # comparison + step-sweep + UQ tables & Plotly bars
├── run_abupt_diffusion.sbatch
└── run_abupt_regression.sbatch

References