Benchmarking and Profiling

OpenTau ships three diagnostic scripts under src/opentau/scripts/ that together pin down where a training run is spending its time. All three read the same TrainPipelineConfig as opentau-train, so you can point them at any training config JSON and reproduce the exact model / dataset / batch size the real training uses.

Script	When to reach for it
`profile_step.py`	Measure per-step wall-clock, broken down into forward / backward / optimizer / sync phases. Use this first when asking “why is training slow?”
`profile_dataloader.py`	Measure the dataloader-only throughput ceiling (no model, no collective). Use this to rule out input-pipeline starvation before looking at the GPU.
`find_unused_params.py`	List parameters that receive no gradient during a real forward+backward. Use before setting `FIND_UNUSED_PARAMS=false` on a new policy.

A worked example end-to-end — diagnosing a real “low GPU utilization” issue, ruling out dataloader starvation, and confirming the bottleneck is DeepSpeed’s per-parameter hook overhead — is documented in issue #177.

profile_step.py — per-step timing breakdown

Mirrors opentau-train’s setup (Accelerator, dataset mixture, policy, optimizer, LR scheduler) and runs a short loop. Splits per-step wall-clock into eight phases and reports mean / median / p95 for each.

Basic usage — same launch incantation as opentau-train:

accelerate launch \
    --config_file configs/examples/accelerate_ddp_config.yaml \
    src/opentau/scripts/profile_step.py \
    --config_path=configs/libero/reproduce_pi05_libero.json \
    --batch_size=12

Example output (8×A100, DDP, pi05 / LIBERO at bs=12):

=========== profile_step results (rank 0) ===========
warmup=20 measured=200 ranks=8
batch_size=12 num_workers=16 prefetch_factor=8
wall-clock over full loop: 265.89s

phase            stats                                                           share
------------------------------------------------------------------------------------------
dataload_wait    mean=   1.24ms  median=   1.18ms  p95=   1.60ms                 0.1%
forward          mean= 378.86ms  median= 378.65ms  p95= 381.75ms                32.8%
bwd              mean= 706.32ms  median= 704.74ms  p95= 710.24ms                61.1%
unscale_clip     mean=  19.55ms  median=  19.43ms  p95=  21.12ms                 1.7%
optim_step       mean=  39.22ms  median=  39.17ms  p95=  39.81ms                 3.4%
zero_grad_sched  mean=   1.79ms  median=   1.74ms  p95=   2.74ms                 0.2%
backward_step    mean= 766.89ms  median= 765.58ms  p95= 770.31ms                66.3%
sync_gather      mean=   9.64ms  median=   6.47ms  p95=   9.78ms                 0.8%
total            mean=1156.63ms  median=1151.73ms  p95=1158.28ms                <-- total

throughput: 0.86 steps/s, 83.0 global samples/s
=====================================================

Reading the output:

High share in dataload_wait (> ~5%) means the dataloader is not keeping up with the GPUs. Run profile_dataloader.py to confirm and raise num_workers / set persistent_workers=True / enable_cpu_affinity=True as appropriate.
High share in bwd with a large gap between 1-GPU and N-GPU typically indicates distributed-backend overhead. Try a single-GPU run (PROFILE_NO_OPTIM=1) for comparison — if single-GPU bwd is close to ~2× forward and N-GPU is much larger, the delta is host-side work, not compute.
High share in optim_step is normal for large models; if > 10% and you’re on CUDA, make sure AdamConfig.fused is True (the default since PR #176).
High share in sync_gather points at the accelerator.gather_for_metrics(...).item() calls in update_policy. They run every step (not just every log_freq) and can be gated behind log_freq if they become a bottleneck.

Environment variables (all optional):

Variable	Default	Effect
`PROFILE_STEPS`	200	Number of measured steps (after 20 warmup). `cfg.steps` from the training config is intentionally ignored because production configs typically set it to 1M.
`PROFILE_NO_OPTIM`	0	When `1`, skip optimizer creation and `optimizer.step` / `zero_grad` entirely. Useful for isolating raw forward + backward compute on a single GPU (no Adam state allocated, so a large bf16 model fits without ZeRO partitioning).
`FIND_UNUSED_PARAMS`	true	Toggles DDP’s `find_unused_parameters` kwarg. Silently ignored under DeepSpeed. Set to `false` after auditing the policy with `find_unused_params.py`.
`FUSED_ADAMW`	(unset)	When `true` / `false`, force-toggle `torch.optim.AdamW(fused=...)` for an A/B comparison without editing the optimizer config JSON. Unset leaves the factory default alone.
`PROFILE_STEP_JSON`	(unset)	When set to a file path, rank 0 writes a JSON summary of phase means and medians after the loop. Convenient for scripted A/B sweeps.

profile_dataloader.py — dataloader throughput ceiling

Builds the exact same WeightedDatasetMixture.get_dataloader() the training loop uses (num_workers, prefetch_factor, pin_memory, HierarchicalSampler) and iterates batches with no model, no optimizer, no collective. Any slowdown here is pure input-pipeline cost.

Run under the same launcher as training so the host CPU sees the real multi-rank × N-worker pressure:

accelerate launch \
    --config_file configs/examples/accelerate_ddp_config.yaml \
    src/opentau/scripts/profile_dataloader.py \
    --config_path=configs/libero/reproduce_pi05_libero.json

Example output (8 ranks):

[rank 0/8] fetch=mean=  80.76ms median=   0.20ms p95= 605.39ms | h2d=mean= 0.40ms ...
[rank 1/8] fetch=mean=  83.40ms median=   0.16ms p95= 754.82ms | h2d=mean= 0.49ms ...
... one line per rank ...

=========== profile_dataloader summary (rank 0) ===========
world_size=8 batch_size=12 num_workers=16 prefetch_factor=8
wall-clock over full loop: 27.23s
per-rank batches/s (min / mean / max): 11.99 / 12.45 / 12.93
cluster-wide samples/s (ceiling, no model): 597.6
===========================================================

Reading the output:

Compare cluster-wide samples/s against the samples/s from profile_step.py. If dataloader throughput is at or below the training step rate, the input pipeline is your bottleneck. If dataloader throughput is comfortably ahead, the bottleneck is GPU-side (forward / backward / optim).
Bimodal fetch distribution (median ≈ 0 ms, p95 ≈ hundreds of ms) means you’re alternately hitting the prefetch buffer and blocking on worker decode. That’s normal — only the mean matters for long-run throughput.

Environment variables:

Variable	Default	Effect
`PROFILE_BATCHES`	300	Number of measured batches after 20 warmup. Raise if p95 is high and you want a more stable mean.

find_unused_params.py — list parameters DDP would reject

Runs one forward + backward on a real batch on a single GPU (no DDP, no DeepSpeed) and prints every parameter where param.requires_grad is True but param.grad is None after backward. Those are exactly the parameters DDP would refuse to sync with find_unused_parameters=False.

Run as a plain Python invocation — no accelerate launch needed:

python src/opentau/scripts/find_unused_params.py \
    --config_path=configs/libero/reproduce_pi05_libero.json

Example output (pi05, after the PR that dropped gemma_expert.lm_head):

#==============================================================================
# pi05 parameter audit — single forward + backward, single GPU
# include_zero_grad=False
#==============================================================================

========== UNUSED (requires_grad=True, grad is None) — DDP will refuse without
                  find_unused_parameters=True (0 tensors, 0 params) ==========

========== FROZEN (requires_grad=False) — context (8 tensors, 256 params) ==========
  [normalize_discrete_actions.buffer_actions.max]  (1 tensors, 32 params)
    - normalize_discrete_actions.buffer_actions.max  shape=(32,)
  ...

# USED (requires_grad=True, grad is non-trivial): 814 tensors
# Tip: if UNUSED list is empty, you can flip
#  DistributedDataParallelKwargs(find_unused_parameters=False) safely.

Recommended workflow:

Run find_unused_params.py on your policy.
If UNUSED is empty, set FIND_UNUSED_PARAMS=false when launching opentau-train (or drop the DistributedDataParallelKwargs kwarg in train.py for your fork) to reclaim the per-step graph-walk cost.
If UNUSED is non-empty, each reported tensor is either an orphan in the model graph (fix by freezing it or deleting the module) or a parameter that’s only conditionally reached (fix by adding an unconditional graph edge, e.g. + 0 * unused_param.sum() in the loss).

Environment variables:

Variable	Default	Effect
`INCLUDE_ZERO_GRAD`	false	When `true`, also list parameters whose grad tensor exists but is all-zero. These are typically paths touched on some batches but producing no learning signal on this one. Usually safe to ignore; useful for auditing conditionally-active heads.

Example: a typical benchmarking session

Given low samples/s in a training run, a sensible sequence to rule out candidates in order of likelihood:

# 1. Is the dataloader keeping up? (~2-10 minutes)
accelerate launch \
    --config_file configs/examples/accelerate_ddp_config.yaml \
    src/opentau/scripts/profile_dataloader.py \
    --config_path=<your_config.json>

# 2. Where does per-step time go? (~4 minutes at 200 steps)
accelerate launch \
    --config_file configs/examples/accelerate_ddp_config.yaml \
    src/opentau/scripts/profile_step.py \
    --config_path=<your_config.json> \
    --batch_size=<your_bs>

# 3. Is DDP's find_unused_parameters=True costing you? Only if
#    backward_step looks unusually high. (~1 minute, single GPU)
python src/opentau/scripts/find_unused_params.py \
    --config_path=<your_config.json>

# 4. A/B an optimizer or distributed-backend change without
#    touching any config file:
FUSED_ADAMW=false accelerate launch ... profile_step.py ...
FUSED_ADAMW=true  accelerate launch ... profile_step.py ...

DeepSpeed ZeRO-2 vs ZeRO-3 for pi05 full fine-tuning

ZeRO-3 shards the model parameters across ranks on top of the gradient and optimizer-state sharding that ZeRO-2 already does. That extra sharding pays off only when a single replica of the model does not fit in one GPU’s memory — it adds a per-layer parameter all-gather in the forward (and a matching reduce-scatter in the backward) that ZeRO-2 does not need. pi05 is ≈3.3B parameters and fits comfortably replicated on an 80 GB GPU, so ZeRO-3 has nothing to gain and pays the all-gather cost.

Measured on 8×A100-80GB, full fine-tuning (no frozen weights; ``freeze_vision_encoder=false``, ``train_expert_only=false``), bf16, sdpa attention, ``use_torch_compile=false``, ``gradient_accumulation_steps=1``, 8 ranks, with the configs/examples/accelerate_deepspeed* configs and the pi05 reference policy (2 cameras at 224×224, chunk_size=10, predict_response=true) on TensorAuto/libero:

Backend	Per-rank batch	Global batch	sec/step	samples/s	Peak GPU mem
ZeRO-2	8	64	5.64	11.4	53.3 GiB
ZeRO-3	8	64	7.71	8.3	62.8 GiB
ZeRO-2	16	128	5.29	24.2	78.7 GiB
ZeRO-3	16	128	9.86	13.0	79.2 GiB

Both backends OOM at the same per-rank batch size on this hardware (16 fits, 18 OOMs): ZeRO-3 frees ~5 GB of replicated parameters per rank, but its parameter all-gather/prefetch buffers plus the extra allocator fragmentation consume a comparable amount, so the maximum batch is unchanged. At a matched batch size ZeRO-2 is ~1.4× faster at batch 8 and ~1.9× faster at batch 16 (the per-step parameter all-gather is the difference; both keep fp32 master weights and step the optimizer identically).

Recommendation: use plain DDP (fastest) or ZeRO-2 for pi05 and similarly sized policies. Reach for ZeRO-3 only when a single replica no longer fits per GPU (much larger backbones / many-billion-parameter experts). ZeRO-3 is fully supported and validated for pi05 — training, checkpoint save, resume, offline checkpoint consolidation (convert_checkpoint.sh), and in-training validation all work — it is simply not the throughput-optimal choice at this model size.

To reproduce, run the same config under each accelerate file (both at 8 ranks / gradient_accumulation_steps=1) and read the per-step time from the logs; samples/s = per_rank_batch × num_ranks ÷ sec_per_step:

COMMON="--policy.freeze_vision_encoder=false --policy.train_expert_only=false \
    --policy.use_torch_compile=false --policy.attention_implementation=sdpa \
    --batch_size=16 --dataloader_batch_size=16 --gradient_accumulation_steps=1"

# ZeRO-2
accelerate launch --config_file configs/examples/accelerate_deepspeed_config.yaml --num_processes 8 \
    src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON

# ZeRO-3
accelerate launch --config_file configs/examples/accelerate_deepspeed_zero3_config.yaml \
    src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON

Note

If a ZeRO-3 run OOMs from fragmentation (the error mentions “reserved but unallocated” memory), set PYTORCH_ALLOC_CONF=expandable_segments:True in the environment to recover the fragmented blocks.