Fix slow training, low GPU utilization, OOMs, bad multi-GPU scaling, and LLM inference bottlenecks.
English | 中文
This repository is for engineers who already know their model should be faster, but need a practical path to find the bottleneck and fix it.
Use it in two ways:
- Read the playbook when you need a prioritized GPU optimization checklist.
- Install the included Claude skill when you want Claude to reason from the same playbook automatically.
If this saves you debugging time, star the repo so more people can find it.
- Training loops that are slow despite high-end GPUs
- GPUs waiting on the CPU or
DataLoader - OOMs caused by activations, optimizer state, or shape variance
- Multi-GPU jobs with poor scaling efficiency
- LLM inference stacks with bad latency, throughput, or KV cache strategy
- Code reviews where performance bugs hide behind "correct" code
Philosophy: measure first, classify the bottleneck, apply the smallest fix that matters, then verify.
| Asset | Why It Matters |
|---|---|
| Training Optimization | P0/P1 defaults for PyTorch training: AMP/bf16, torch.compile, data pipelines, memory, checkpointing |
| Inference Optimization | TTFT/ITL, KV cache, continuous batching, quantization, serving tradeoffs |
| Distributed Training | DDP, FSDP/ZeRO, comm overlap, topology awareness, NCCL tuning |
| Kernel & Low-Level | Triton, FlashAttention, fusion, memory access, roofline thinking |
| Profiling & Tools | Nsight Systems, Nsight Compute, PyTorch Profiler, MFU, diagnostics |
| Anti-Patterns & Checklist | 12 common mistakes, PR review prompts, pre-commit checks |
| Claude Skill | Uploadable skill package for Claude.ai |
- Download
skill/best-gpu-perf.skill. - Open
Claude.ai -> Settings -> Skills. - Click
Add Skilland upload the file.
Then try prompts like:
My training is slow on A100. Give me the highest-ROI things to check first.Review this PyTorch training loop for GPU performance issues.I'm hitting OOM on a 7B model. Walk me through the memory budget and likely fixes.My multi-GPU scaling falls off after 4 GPUs. What should I profile?Help me optimize LLM inference latency and throughput for a vLLM-style stack.
If you want to customize the skill, the source lives in skill/SKILL.md. Rebuild the packaged artifact with:
./scripts/build_skill.sh| If your problem is... | Read this first |
|---|---|
| Slow training | docs/training.md |
| Slow inference / poor throughput | docs/inference.md |
| Bad DDP/FSDP scaling | docs/distributed.md |
| Kernel hot spots / fusion questions | docs/kernel.md |
| You need profiler commands and interpretation help | docs/profiling.md |
| You want a fast review checklist | docs/checklist.md |
- PyTorch engineers who need faster training without rewriting everything
- LLM infra and serving engineers debugging latency, throughput, or KV cache behavior
- Researchers who want a default optimization checklist before touching custom kernels
- Reviewers who want to catch performance regressions in PRs
docs/: topic-based reference guidesskill/SKILL.md: source for the Claude skillskill/best-gpu-perf.skill: packaged skill artifact for uploadscripts/build_skill.sh: rebuilds the packaged skill from repo sources
Before doing anything else, make sure your code has all of these. They're low-risk, high-reward, and apply to almost every PyTorch project:
import torch, os
from torch.utils.data import DataLoader
# 1. Enable TF32 (free speed on Ampere+)
torch.set_float32_matmul_precision("high")
# 2. DataLoader: don't let GPU wait for data
loader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=4, # NOT 0
pin_memory=True, # Faster CPU→GPU transfer
persistent_workers=True, # Don't re-fork every epoch
)
# 3. Compile the model
model = torch.compile(model.cuda())
# 4. Fused optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, fused=True)
# 5. Training loop with AMP + best practices
for batch in loader:
batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()}
optimizer.zero_grad(set_to_none=True) # Not zero_grad()
with torch.autocast("cuda", dtype=torch.bfloat16): # 6. Mixed precision
loss = model(**batch)
loss.backward()
optimizer.step()
# ❌ Don't: print(loss), loss.item() every step, loss.cpu()Also check:
- Hidden dims / vocab size are multiples of 64
- Attention uses
F.scaled_dot_product_attention, not manual QK^TV - Eval uses
@torch.inference_mode() - No
.item()/.cpu()/print(tensor)in the training hot path
👉 Full checklist with 12 anti-patterns and PR template → docs/checklist.md
GPU utilization < 50%?
├─ CPU at 100%? → DataLoader bottleneck → more workers, pin_memory, pre-cache
├─ CPU idle too? → Launch overhead → torch.compile, larger batch, CUDA Graphs
│
GPU util 50–80%?
├─ Many tiny kernels → Fusion needed: torch.compile, FlashAttention
├─ HBM bandwidth saturated → Memory-bound: kernel fusion, bf16
│
GPU util > 80% but MFU < 30%?
├─ No mixed precision → Enable bf16
├─ Hand-written attention → F.scaled_dot_product_attention
└─ Misaligned dimensions → Pad to multiples of 8/64
1. Enable AMP (fp32 → bf16 = ~2x memory saving)
2. Gradient checkpointing on Transformer blocks
3. Smaller batch + gradient accumulation
4. Reduce shape variance (bucketing)
5. FSDP / ZeRO to shard across GPUs
6. CPU offload (last resort)
1. Is communication exposed? (check Nsight Systems timeline)
2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view
3. Gradient accumulation + no_sync()
4. FSDP: check prefetch settings, sharding strategy
5. Network: NVLink vs PCIe? (nvidia-smi topo -m)
6. Data: balanced distribution across ranks?
| GPU | BF16 Tensor TFLOPS | HBM BW (TB/s) | Memory | AI Crossover (FLOP/Byte) |
|---|---|---|---|---|
| A100 40GB | 312 | 1.6 | 40 GB | ~195 |
| A100 80GB | 312 | 2.0 | 80 GB | ~156 |
| A800 80GB | 312 | 2.0 | 80 GB | ~156 |
| H100 SXM | 989 | 3.35 | 80 GB | ~295 |
| H100 PCIe | 756 | 2.0 | 80 GB | ~378 |
| H800 SXM | 989 | 3.35 | 80 GB | ~295 |
| H200 SXM | 989 | 4.8 | 141 GB | ~206 |
| B200 SXM | 4,500 | 8.0 | 192 GB | ~563 |
| B300 SXM | 4,500+ | 8.0 | 288 GB | ~563 |
| L40S | 362 | 0.86 | 48 GB | ~421 |
| GPU | BF16 Tensor TFLOPS | VRAM BW (TB/s) | Memory | AI Crossover (FLOP/Byte) |
|---|---|---|---|---|
| RTX 4090 | 330 | 1.0 | 24 GB | ~330 |
| RTX 4080 | 194 | 0.72 | 16 GB | ~269 |
| RTX 3090 | 142 | 0.94 | 24 GB | ~151 |
Operations below the crossover ratio are memory-bound — fusion matters more than faster compute.
Most element-wise ops are ~1–4 FLOP/Byte → almost always memory-bound.Key insight: H200 vs H100 = same compute but 43% more bandwidth → better for LLM inference.
B200/B300 = ~4.5x compute over H100 but only ~2.4x bandwidth → the crossover point is much higher, making more operations compute-bound on Blackwell.
Model with Φ parameters, bf16 + Adam:
Parameters: 2Φ (bf16)
Gradients: 2Φ (bf16)
Optimizer: 12Φ (fp32 copy + momentum + variance)
Total: ~16Φ (+ activations)
7B model → ~112 GB before activations
→ Won't fit on one 80GB GPU without FSDP/ZeRO
Official documentation this guide is based on:
- PyTorch Performance Tuning Guide
- PyTorch torch.compile
- PyTorch AMP
- PyTorch DDP
- PyTorch FSDP
- PyTorch Profiler
- CUDA C++ Best Practices Guide
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
- FlashAttention
- Triton Tutorials
- vLLM
- DeepSpeed ZeRO
Contributions welcome! If you have a trick that reliably speeds things up and you've measured it, open a PR. Please include:
- What bottleneck it addresses (compute / memory / communication / launch overhead)
- Measured before/after on a specific setup
- Any risks or caveats
If you're publishing or promoting this repository, use the launch kit for:
- GitHub repo description and topic suggestions
- release checklist and demo ideas
- ready-to-post launch copy for X, Hacker News, Reddit, LinkedIn, and Chinese dev communities
MIT — use it however you want.
