Benchmarking and Profiling
OpenTau ships three diagnostic scripts under src/opentau/scripts/ that
together pin down where a training run is spending its time. All three read
the same TrainPipelineConfig as opentau-train, so you can point them
at any training config JSON and reproduce the exact model / dataset / batch
size the real training uses.
Script |
When to reach for it |
|---|---|
|
Measure per-step wall-clock, broken down into forward / backward / optimizer / sync phases. Use this first when asking “why is training slow?” |
|
Measure the dataloader-only throughput ceiling (no model, no collective). Use this to rule out input-pipeline starvation before looking at the GPU. |
|
List parameters that receive no gradient during a real
forward+backward. Use before setting
|
A worked example end-to-end — diagnosing a real “low GPU utilization” issue, ruling out dataloader starvation, and confirming the bottleneck is DeepSpeed’s per-parameter hook overhead — is documented in issue #177.
profile_step.py — per-step timing breakdown
Mirrors opentau-train’s setup (Accelerator, dataset mixture, policy,
optimizer, LR scheduler) and runs a short loop. Splits per-step
wall-clock into eight phases and reports mean / median / p95 for each.
Basic usage — same launch incantation as opentau-train:
accelerate launch \
--config_file configs/examples/accelerate_ddp_config.yaml \
src/opentau/scripts/profile_step.py \
--config_path=configs/libero/reproduce_pi05_libero.json \
--batch_size=12
Example output (8×A100, DDP, pi05 / LIBERO at bs=12):
=========== profile_step results (rank 0) ===========
warmup=20 measured=200 ranks=8
batch_size=12 num_workers=16 prefetch_factor=8
wall-clock over full loop: 265.89s
phase stats share
------------------------------------------------------------------------------------------
dataload_wait mean= 1.24ms median= 1.18ms p95= 1.60ms 0.1%
forward mean= 378.86ms median= 378.65ms p95= 381.75ms 32.8%
bwd mean= 706.32ms median= 704.74ms p95= 710.24ms 61.1%
unscale_clip mean= 19.55ms median= 19.43ms p95= 21.12ms 1.7%
optim_step mean= 39.22ms median= 39.17ms p95= 39.81ms 3.4%
zero_grad_sched mean= 1.79ms median= 1.74ms p95= 2.74ms 0.2%
backward_step mean= 766.89ms median= 765.58ms p95= 770.31ms 66.3%
sync_gather mean= 9.64ms median= 6.47ms p95= 9.78ms 0.8%
total mean=1156.63ms median=1151.73ms p95=1158.28ms <-- total
throughput: 0.86 steps/s, 83.0 global samples/s
=====================================================
Reading the output:
High share in
dataload_wait(> ~5%) means the dataloader is not keeping up with the GPUs. Runprofile_dataloader.pyto confirm and raisenum_workers/ setpersistent_workers=True/enable_cpu_affinity=Trueas appropriate.High share in
bwdwith a large gap between 1-GPU and N-GPU typically indicates distributed-backend overhead. Try a single-GPU run (PROFILE_NO_OPTIM=1) for comparison — if single-GPUbwdis close to ~2× forward and N-GPU is much larger, the delta is host-side work, not compute.High share in
optim_stepis normal for large models; if > 10% and you’re on CUDA, make sureAdamConfig.fusedisTrue(the default since PR #176).High share in
sync_gatherpoints at theaccelerator.gather_for_metrics(...).item()calls inupdate_policy. They run every step (not just everylog_freq) and can be gated behindlog_freqif they become a bottleneck.
Environment variables (all optional):
Variable |
Default |
Effect |
|---|---|---|
|
200 |
Number of measured steps (after 20 warmup). |
|
0 |
When |
|
true |
Toggles DDP’s |
|
(unset) |
When |
|
(unset) |
When set to a file path, rank 0 writes a JSON summary of phase means and medians after the loop. Convenient for scripted A/B sweeps. |
profile_dataloader.py — dataloader throughput ceiling
Builds the exact same WeightedDatasetMixture.get_dataloader() the
training loop uses (num_workers, prefetch_factor, pin_memory,
HierarchicalSampler) and iterates batches with no model, no
optimizer, no collective. Any slowdown here is pure input-pipeline
cost.
Run under the same launcher as training so the host CPU sees the real multi-rank × N-worker pressure:
accelerate launch \
--config_file configs/examples/accelerate_ddp_config.yaml \
src/opentau/scripts/profile_dataloader.py \
--config_path=configs/libero/reproduce_pi05_libero.json
Example output (8 ranks):
[rank 0/8] fetch=mean= 80.76ms median= 0.20ms p95= 605.39ms | h2d=mean= 0.40ms ...
[rank 1/8] fetch=mean= 83.40ms median= 0.16ms p95= 754.82ms | h2d=mean= 0.49ms ...
... one line per rank ...
=========== profile_dataloader summary (rank 0) ===========
world_size=8 batch_size=12 num_workers=16 prefetch_factor=8
wall-clock over full loop: 27.23s
per-rank batches/s (min / mean / max): 11.99 / 12.45 / 12.93
cluster-wide samples/s (ceiling, no model): 597.6
===========================================================
Reading the output:
Compare
cluster-wide samples/sagainst thesamples/sfromprofile_step.py. If dataloader throughput is at or below the training step rate, the input pipeline is your bottleneck. If dataloader throughput is comfortably ahead, the bottleneck is GPU-side (forward / backward / optim).Bimodal fetch distribution (median ≈ 0 ms, p95 ≈ hundreds of ms) means you’re alternately hitting the prefetch buffer and blocking on worker decode. That’s normal — only the mean matters for long-run throughput.
Environment variables:
Variable |
Default |
Effect |
|---|---|---|
|
300 |
Number of measured batches after 20 warmup. Raise if p95 is high and you want a more stable mean. |
find_unused_params.py — list parameters DDP would reject
Runs one forward + backward on a real batch on a single GPU (no DDP, no
DeepSpeed) and prints every parameter where param.requires_grad
is True but param.grad is None after backward. Those are
exactly the parameters DDP would refuse to sync with
find_unused_parameters=False.
Run as a plain Python invocation — no accelerate launch needed:
python src/opentau/scripts/find_unused_params.py \
--config_path=configs/libero/reproduce_pi05_libero.json
Example output (pi05, after the PR that dropped gemma_expert.lm_head):
#==============================================================================
# pi05 parameter audit — single forward + backward, single GPU
# include_zero_grad=False
#==============================================================================
========== UNUSED (requires_grad=True, grad is None) — DDP will refuse without
find_unused_parameters=True (0 tensors, 0 params) ==========
========== FROZEN (requires_grad=False) — context (8 tensors, 256 params) ==========
[normalize_discrete_actions.buffer_actions.max] (1 tensors, 32 params)
- normalize_discrete_actions.buffer_actions.max shape=(32,)
...
# USED (requires_grad=True, grad is non-trivial): 814 tensors
# Tip: if UNUSED list is empty, you can flip
# DistributedDataParallelKwargs(find_unused_parameters=False) safely.
Recommended workflow:
Run
find_unused_params.pyon your policy.If
UNUSEDis empty, setFIND_UNUSED_PARAMS=falsewhen launchingopentau-train(or drop theDistributedDataParallelKwargskwarg intrain.pyfor your fork) to reclaim the per-step graph-walk cost.If
UNUSEDis non-empty, each reported tensor is either an orphan in the model graph (fix by freezing it or deleting the module) or a parameter that’s only conditionally reached (fix by adding an unconditional graph edge, e.g.+ 0 * unused_param.sum()in the loss).
Environment variables:
Variable |
Default |
Effect |
|---|---|---|
|
false |
When |
Example: a typical benchmarking session
Given low samples/s in a training run, a sensible sequence to rule
out candidates in order of likelihood:
# 1. Is the dataloader keeping up? (~2-10 minutes)
accelerate launch \
--config_file configs/examples/accelerate_ddp_config.yaml \
src/opentau/scripts/profile_dataloader.py \
--config_path=<your_config.json>
# 2. Where does per-step time go? (~4 minutes at 200 steps)
accelerate launch \
--config_file configs/examples/accelerate_ddp_config.yaml \
src/opentau/scripts/profile_step.py \
--config_path=<your_config.json> \
--batch_size=<your_bs>
# 3. Is DDP's find_unused_parameters=True costing you? Only if
# backward_step looks unusually high. (~1 minute, single GPU)
python src/opentau/scripts/find_unused_params.py \
--config_path=<your_config.json>
# 4. A/B an optimizer or distributed-backend change without
# touching any config file:
FUSED_ADAMW=false accelerate launch ... profile_step.py ...
FUSED_ADAMW=true accelerate launch ... profile_step.py ...
DeepSpeed ZeRO-2 vs ZeRO-3 for pi05 full fine-tuning
ZeRO-3 shards the model parameters across ranks on top of the gradient and optimizer-state sharding that ZeRO-2 already does. That extra sharding pays off only when a single replica of the model does not fit in one GPU’s memory — it adds a per-layer parameter all-gather in the forward (and a matching reduce-scatter in the backward) that ZeRO-2 does not need. pi05 is ≈3.3B parameters and fits comfortably replicated on an 80 GB GPU, so ZeRO-3 has nothing to gain and pays the all-gather cost.
Measured on 8×A100-80GB, full fine-tuning (no frozen weights;
``freeze_vision_encoder=false``, ``train_expert_only=false``), bf16, sdpa
attention, ``use_torch_compile=false``, ``gradient_accumulation_steps=1``, 8
ranks, with the configs/examples/accelerate_deepspeed* configs and the
pi05 reference policy (2 cameras at 224×224, chunk_size=10,
predict_response=true) on TensorAuto/libero:
Backend |
Per-rank batch |
Global batch |
sec/step |
samples/s |
Peak GPU mem |
|---|---|---|---|---|---|
ZeRO-2 |
8 |
64 |
5.64 |
11.4 |
53.3 GiB |
ZeRO-3 |
8 |
64 |
7.71 |
8.3 |
62.8 GiB |
ZeRO-2 |
16 |
128 |
5.29 |
24.2 |
78.7 GiB |
ZeRO-3 |
16 |
128 |
9.86 |
13.0 |
79.2 GiB |
Both backends OOM at the same per-rank batch size on this hardware (16 fits, 18 OOMs): ZeRO-3 frees ~5 GB of replicated parameters per rank, but its parameter all-gather/prefetch buffers plus the extra allocator fragmentation consume a comparable amount, so the maximum batch is unchanged. At a matched batch size ZeRO-2 is ~1.4× faster at batch 8 and ~1.9× faster at batch 16 (the per-step parameter all-gather is the difference; both keep fp32 master weights and step the optimizer identically).
Recommendation: use plain DDP (fastest) or ZeRO-2 for pi05 and similarly
sized policies. Reach for ZeRO-3 only when a single replica no longer fits per
GPU (much larger backbones / many-billion-parameter experts). ZeRO-3 is fully
supported and validated for pi05 — training, checkpoint save, resume, offline
checkpoint consolidation (convert_checkpoint.sh), and in-training validation
all work — it is simply not the throughput-optimal choice at this model size.
To reproduce, run the same config under each accelerate file (both at 8 ranks /
gradient_accumulation_steps=1) and read the per-step time from the logs;
samples/s = per_rank_batch × num_ranks ÷ sec_per_step:
COMMON="--policy.freeze_vision_encoder=false --policy.train_expert_only=false \
--policy.use_torch_compile=false --policy.attention_implementation=sdpa \
--batch_size=16 --dataloader_batch_size=16 --gradient_accumulation_steps=1"
# ZeRO-2
accelerate launch --config_file configs/examples/accelerate_deepspeed_config.yaml --num_processes 8 \
src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON
# ZeRO-3
accelerate launch --config_file configs/examples/accelerate_deepspeed_zero3_config.yaml \
src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON
Note
If a ZeRO-3 run OOMs from fragmentation (the error mentions
“reserved but unallocated” memory), set
PYTORCH_ALLOC_CONF=expandable_segments:True in the environment to recover
the fragmented blocks.