Benchmarking and Profiling ========================== OpenTau ships three diagnostic scripts under ``src/opentau/scripts/`` that together pin down where a training run is spending its time. All three read the same ``TrainPipelineConfig`` as ``opentau-train``, so you can point them at any training config JSON and reproduce the exact model / dataset / batch size the real training uses. +----------------------------+------------------------------------------------------------+ | Script | When to reach for it | +============================+============================================================+ | ``profile_step.py`` | Measure per-step wall-clock, broken down into forward / | | | backward / optimizer / sync phases. Use this first when | | | asking "why is training slow?" | +----------------------------+------------------------------------------------------------+ | ``profile_dataloader.py`` | Measure the dataloader-only throughput ceiling (no model, | | | no collective). Use this to rule out input-pipeline | | | starvation before looking at the GPU. | +----------------------------+------------------------------------------------------------+ | ``find_unused_params.py`` | List parameters that receive no gradient during a real | | | forward+backward. Use before setting | | | ``FIND_UNUSED_PARAMS=false`` on a new policy. | +----------------------------+------------------------------------------------------------+ A worked example end-to-end — diagnosing a real "low GPU utilization" issue, ruling out dataloader starvation, and confirming the bottleneck is DeepSpeed's per-parameter hook overhead — is documented in `issue #177 `_. profile_step.py — per-step timing breakdown ------------------------------------------- Mirrors ``opentau-train``'s setup (Accelerator, dataset mixture, policy, optimizer, LR scheduler) and runs a short loop. Splits per-step wall-clock into eight phases and reports mean / median / p95 for each. Basic usage — same launch incantation as ``opentau-train``: .. code-block:: bash accelerate launch \ --config_file configs/examples/accelerate_ddp_config.yaml \ src/opentau/scripts/profile_step.py \ --config_path=configs/libero/reproduce_pi05_libero.json \ --batch_size=12 Example output (8×A100, DDP, pi05 / LIBERO at ``bs=12``): .. code-block:: text =========== profile_step results (rank 0) =========== warmup=20 measured=200 ranks=8 batch_size=12 num_workers=16 prefetch_factor=8 wall-clock over full loop: 265.89s phase stats share ------------------------------------------------------------------------------------------ dataload_wait mean= 1.24ms median= 1.18ms p95= 1.60ms 0.1% forward mean= 378.86ms median= 378.65ms p95= 381.75ms 32.8% bwd mean= 706.32ms median= 704.74ms p95= 710.24ms 61.1% unscale_clip mean= 19.55ms median= 19.43ms p95= 21.12ms 1.7% optim_step mean= 39.22ms median= 39.17ms p95= 39.81ms 3.4% zero_grad_sched mean= 1.79ms median= 1.74ms p95= 2.74ms 0.2% backward_step mean= 766.89ms median= 765.58ms p95= 770.31ms 66.3% sync_gather mean= 9.64ms median= 6.47ms p95= 9.78ms 0.8% total mean=1156.63ms median=1151.73ms p95=1158.28ms <-- total throughput: 0.86 steps/s, 83.0 global samples/s ===================================================== Reading the output: - **High share in** ``dataload_wait`` (> ~5%) means the dataloader is not keeping up with the GPUs. Run ``profile_dataloader.py`` to confirm and raise ``num_workers`` / set ``persistent_workers=True`` / ``enable_cpu_affinity=True`` as appropriate. - **High share in** ``bwd`` **with a large gap between 1-GPU and N-GPU** typically indicates distributed-backend overhead. Try a single-GPU run (``PROFILE_NO_OPTIM=1``) for comparison — if single-GPU ``bwd`` is close to ~2× forward and N-GPU is much larger, the delta is host-side work, not compute. - **High share in** ``optim_step`` is normal for large models; if > 10% and you're on CUDA, make sure ``AdamConfig.fused`` is ``True`` (the default since `PR #176 `_). - **High share in** ``sync_gather`` points at the ``accelerator.gather_for_metrics(...).item()`` calls in ``update_policy``. They run every step (not just every ``log_freq``) and can be gated behind ``log_freq`` if they become a bottleneck. Environment variables (all optional): .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Variable - Default - Effect * - ``PROFILE_STEPS`` - 200 - Number of measured steps (after 20 warmup). ``cfg.steps`` from the training config is intentionally ignored because production configs typically set it to 1M. * - ``PROFILE_NO_OPTIM`` - 0 - When ``1``, skip optimizer creation and ``optimizer.step`` / ``zero_grad`` entirely. Useful for isolating raw forward + backward compute on a single GPU (no Adam state allocated, so a large bf16 model fits without ZeRO partitioning). * - ``FIND_UNUSED_PARAMS`` - true - Toggles DDP's ``find_unused_parameters`` kwarg. Silently ignored under DeepSpeed. Set to ``false`` after auditing the policy with ``find_unused_params.py``. * - ``FUSED_ADAMW`` - (unset) - When ``true`` / ``false``, force-toggle ``torch.optim.AdamW(fused=...)`` for an A/B comparison without editing the optimizer config JSON. Unset leaves the factory default alone. * - ``PROFILE_STEP_JSON`` - (unset) - When set to a file path, rank 0 writes a JSON summary of phase means and medians after the loop. Convenient for scripted A/B sweeps. profile_dataloader.py — dataloader throughput ceiling ----------------------------------------------------- Builds the exact same ``WeightedDatasetMixture.get_dataloader()`` the training loop uses (``num_workers``, ``prefetch_factor``, ``pin_memory``, ``HierarchicalSampler``) and iterates batches with **no model, no optimizer, no collective**. Any slowdown here is pure input-pipeline cost. Run under the same launcher as training so the host CPU sees the real multi-rank × N-worker pressure: .. code-block:: bash accelerate launch \ --config_file configs/examples/accelerate_ddp_config.yaml \ src/opentau/scripts/profile_dataloader.py \ --config_path=configs/libero/reproduce_pi05_libero.json Example output (8 ranks): .. code-block:: text [rank 0/8] fetch=mean= 80.76ms median= 0.20ms p95= 605.39ms | h2d=mean= 0.40ms ... [rank 1/8] fetch=mean= 83.40ms median= 0.16ms p95= 754.82ms | h2d=mean= 0.49ms ... ... one line per rank ... =========== profile_dataloader summary (rank 0) =========== world_size=8 batch_size=12 num_workers=16 prefetch_factor=8 wall-clock over full loop: 27.23s per-rank batches/s (min / mean / max): 11.99 / 12.45 / 12.93 cluster-wide samples/s (ceiling, no model): 597.6 =========================================================== Reading the output: - **Compare** ``cluster-wide samples/s`` **against the** ``samples/s`` **from** ``profile_step.py``. If dataloader throughput is at or below the training step rate, the input pipeline is your bottleneck. If dataloader throughput is comfortably ahead, the bottleneck is GPU-side (forward / backward / optim). - **Bimodal fetch distribution** (median ≈ 0 ms, p95 ≈ hundreds of ms) means you're alternately hitting the prefetch buffer and blocking on worker decode. That's normal — only the mean matters for long-run throughput. Environment variables: .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Variable - Default - Effect * - ``PROFILE_BATCHES`` - 300 - Number of measured batches after 20 warmup. Raise if p95 is high and you want a more stable mean. find_unused_params.py — list parameters DDP would reject -------------------------------------------------------- Runs one forward + backward on a real batch on a single GPU (no DDP, no DeepSpeed) and prints every parameter where ``param.requires_grad`` is ``True`` but ``param.grad`` is ``None`` after backward. Those are exactly the parameters DDP would refuse to sync with ``find_unused_parameters=False``. Run as a plain Python invocation — no ``accelerate launch`` needed: .. code-block:: bash python src/opentau/scripts/find_unused_params.py \ --config_path=configs/libero/reproduce_pi05_libero.json Example output (pi05, after the PR that dropped ``gemma_expert.lm_head``): .. code-block:: text #============================================================================== # pi05 parameter audit — single forward + backward, single GPU # include_zero_grad=False #============================================================================== ========== UNUSED (requires_grad=True, grad is None) — DDP will refuse without find_unused_parameters=True (0 tensors, 0 params) ========== ========== FROZEN (requires_grad=False) — context (8 tensors, 256 params) ========== [normalize_discrete_actions.buffer_actions.max] (1 tensors, 32 params) - normalize_discrete_actions.buffer_actions.max shape=(32,) ... # USED (requires_grad=True, grad is non-trivial): 814 tensors # Tip: if UNUSED list is empty, you can flip # DistributedDataParallelKwargs(find_unused_parameters=False) safely. Recommended workflow: 1. Run ``find_unused_params.py`` on your policy. 2. If ``UNUSED`` is empty, set ``FIND_UNUSED_PARAMS=false`` when launching ``opentau-train`` (or drop the ``DistributedDataParallelKwargs`` kwarg in ``train.py`` for your fork) to reclaim the per-step graph-walk cost. 3. If ``UNUSED`` is non-empty, each reported tensor is either an orphan in the model graph (fix by freezing it or deleting the module) or a parameter that's only conditionally reached (fix by adding an unconditional graph edge, e.g. ``+ 0 * unused_param.sum()`` in the loss). Environment variables: .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Variable - Default - Effect * - ``INCLUDE_ZERO_GRAD`` - false - When ``true``, also list parameters whose grad tensor exists but is all-zero. These are typically paths touched on some batches but producing no learning signal on this one. Usually safe to ignore; useful for auditing conditionally-active heads. Example: a typical benchmarking session --------------------------------------- Given low ``samples/s`` in a training run, a sensible sequence to rule out candidates in order of likelihood: .. code-block:: bash # 1. Is the dataloader keeping up? (~2-10 minutes) accelerate launch \ --config_file configs/examples/accelerate_ddp_config.yaml \ src/opentau/scripts/profile_dataloader.py \ --config_path= # 2. Where does per-step time go? (~4 minutes at 200 steps) accelerate launch \ --config_file configs/examples/accelerate_ddp_config.yaml \ src/opentau/scripts/profile_step.py \ --config_path= \ --batch_size= # 3. Is DDP's find_unused_parameters=True costing you? Only if # backward_step looks unusually high. (~1 minute, single GPU) python src/opentau/scripts/find_unused_params.py \ --config_path= # 4. A/B an optimizer or distributed-backend change without # touching any config file: FUSED_ADAMW=false accelerate launch ... profile_step.py ... FUSED_ADAMW=true accelerate launch ... profile_step.py ... DeepSpeed ZeRO-2 vs ZeRO-3 for pi05 full fine-tuning ---------------------------------------------------- ZeRO-3 shards the model *parameters* across ranks on top of the gradient and optimizer-state sharding that ZeRO-2 already does. That extra sharding pays off only when a single replica of the model does not fit in one GPU's memory — it adds a per-layer parameter all-gather in the forward (and a matching reduce-scatter in the backward) that ZeRO-2 does not need. pi05 is ≈3.3B parameters and fits comfortably replicated on an 80 GB GPU, so ZeRO-3 has nothing to gain and pays the all-gather cost. Measured on **8×A100-80GB, full fine-tuning (no frozen weights; ``freeze_vision_encoder=false``, ``train_expert_only=false``), bf16, sdpa attention, ``use_torch_compile=false``, ``gradient_accumulation_steps=1``, 8 ranks**, with the ``configs/examples/accelerate_deepspeed*`` configs and the pi05 reference policy (2 cameras at 224×224, ``chunk_size=10``, ``predict_response=true``) on ``TensorAuto/libero``: .. list-table:: :header-rows: 1 :widths: 14 14 12 12 14 14 * - Backend - Per-rank batch - Global batch - sec/step - samples/s - Peak GPU mem * - ZeRO-2 - 8 - 64 - 5.64 - 11.4 - 53.3 GiB * - ZeRO-3 - 8 - 64 - 7.71 - 8.3 - 62.8 GiB * - ZeRO-2 - 16 - 128 - 5.29 - 24.2 - 78.7 GiB * - ZeRO-3 - 16 - 128 - 9.86 - 13.0 - 79.2 GiB Both backends OOM at the same per-rank batch size on this hardware (16 fits, 18 OOMs): ZeRO-3 frees ~5 GB of replicated parameters per rank, but its parameter all-gather/prefetch buffers plus the extra allocator fragmentation consume a comparable amount, so the maximum batch is unchanged. At a matched batch size ZeRO-2 is ~1.4× faster at batch 8 and ~1.9× faster at batch 16 (the per-step parameter all-gather is the difference; both keep fp32 master weights and step the optimizer identically). **Recommendation:** use plain DDP (fastest) or ZeRO-2 for pi05 and similarly sized policies. Reach for ZeRO-3 only when a single replica no longer fits per GPU (much larger backbones / many-billion-parameter experts). ZeRO-3 *is* fully supported and validated for pi05 — training, checkpoint save, resume, offline checkpoint consolidation (``convert_checkpoint.sh``), and in-training validation all work — it is simply not the throughput-optimal choice at this model size. To reproduce, run the same config under each accelerate file (both at 8 ranks / ``gradient_accumulation_steps=1``) and read the per-step time from the logs; ``samples/s = per_rank_batch × num_ranks ÷ sec_per_step``: .. code-block:: bash COMMON="--policy.freeze_vision_encoder=false --policy.train_expert_only=false \ --policy.use_torch_compile=false --policy.attention_implementation=sdpa \ --batch_size=16 --dataloader_batch_size=16 --gradient_accumulation_steps=1" # ZeRO-2 accelerate launch --config_file configs/examples/accelerate_deepspeed_config.yaml --num_processes 8 \ src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON # ZeRO-3 accelerate launch --config_file configs/examples/accelerate_deepspeed_zero3_config.yaml \ src/opentau/scripts/train.py --config_path=configs/examples/pi05_training_config.json $COMMON .. note:: If a ZeRO-3 run OOMs from fragmentation (the error mentions "reserved but unallocated" memory), set ``PYTORCH_ALLOC_CONF=expandable_segments:True`` in the environment to recover the fragmented blocks.