GPU Perf Playbook for PyTorch and LLM Workloads

Fix slow training, low GPU utilization, OOMs, bad multi-GPU scaling, and LLM inference bottlenecks.

English | 中文

This repository is for engineers who already know their model should be faster, but need a practical path to find the bottleneck and fix it.

Use it in two ways:

Read the playbook when you need a prioritized GPU optimization checklist.
Install the included Claude skill when you want Claude to reason from the same playbook automatically.

If this saves you debugging time, star the repo so more people can find it.

What This Helps You Fix

Training loops that are slow despite high-end GPUs
GPUs waiting on the CPU or DataLoader
OOMs caused by activations, optimizer state, or shape variance
Multi-GPU jobs with poor scaling efficiency
LLM inference stacks with bad latency, throughput, or KV cache strategy
Code reviews where performance bugs hide behind "correct" code

Philosophy: measure first, classify the bottleneck, apply the smallest fix that matters, then verify.

What You Get

Asset	Why It Matters
Training Optimization	P0/P1 defaults for PyTorch training: AMP/bf16, `torch.compile`, data pipelines, memory, checkpointing
Inference Optimization	TTFT/ITL, KV cache, continuous batching, quantization, serving tradeoffs
Distributed Training	DDP, FSDP/ZeRO, comm overlap, topology awareness, NCCL tuning
Kernel & Low-Level	Triton, FlashAttention, fusion, memory access, roofline thinking
Profiling & Tools	Nsight Systems, Nsight Compute, PyTorch Profiler, MFU, diagnostics
Anti-Patterns & Checklist	12 common mistakes, PR review prompts, pre-commit checks
Claude Skill	Uploadable skill package for Claude.ai

Install In Claude In 60 Seconds

Download skill/best-gpu-perf.skill.
Open Claude.ai -> Settings -> Skills.
Click Add Skill and upload the file.

Then try prompts like:

My training is slow on A100. Give me the highest-ROI things to check first.
Review this PyTorch training loop for GPU performance issues.
I'm hitting OOM on a 7B model. Walk me through the memory budget and likely fixes.
My multi-GPU scaling falls off after 4 GPUs. What should I profile?
Help me optimize LLM inference latency and throughput for a vLLM-style stack.

If you want to customize the skill, the source lives in skill/SKILL.md. Rebuild the packaged artifact with:

./scripts/build_skill.sh

Start Here

If your problem is...	Read this first
Slow training	docs/training.md
Slow inference / poor throughput	docs/inference.md
Bad DDP/FSDP scaling	docs/distributed.md
Kernel hot spots / fusion questions	docs/kernel.md
You need profiler commands and interpretation help	docs/profiling.md
You want a fast review checklist	docs/checklist.md

Who This Is For

PyTorch engineers who need faster training without rewriting everything
LLM infra and serving engineers debugging latency, throughput, or KV cache behavior
Researchers who want a default optimization checklist before touching custom kernels
Reviewers who want to catch performance regressions in PRs

Repository Layout

docs/: topic-based reference guides
skill/SKILL.md: source for the Claude skill
skill/best-gpu-perf.skill: packaged skill artifact for upload
scripts/build_skill.sh: rebuilds the packaged skill from repo sources

⚡ Quick Start: The P0 Checklist

Before doing anything else, make sure your code has all of these. They're low-risk, high-reward, and apply to almost every PyTorch project:

import torch, os
from torch.utils.data import DataLoader

# 1. Enable TF32 (free speed on Ampere+)
torch.set_float32_matmul_precision("high")

# 2. DataLoader: don't let GPU wait for data
loader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=4,                # NOT 0
    pin_memory=True,              # Faster CPU→GPU transfer
    persistent_workers=True,      # Don't re-fork every epoch
)

# 3. Compile the model
model = torch.compile(model.cuda())

# 4. Fused optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, fused=True)

# 5. Training loop with AMP + best practices
for batch in loader:
    batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()}
    optimizer.zero_grad(set_to_none=True)         # Not zero_grad()

    with torch.autocast("cuda", dtype=torch.bfloat16):  # 6. Mixed precision
        loss = model(**batch)

    loss.backward()
    optimizer.step()
    # ❌ Don't: print(loss), loss.item() every step, loss.cpu()

Also check:

Hidden dims / vocab size are multiples of 64
Attention uses F.scaled_dot_product_attention, not manual QK^TV
Eval uses @torch.inference_mode()
No .item() / .cpu() / print(tensor) in the training hot path

👉 Full checklist with 12 anti-patterns and PR template → docs/checklist.md

🧭 Decision Trees

My training is slow

GPU utilization < 50%?
├─ CPU at 100%? → DataLoader bottleneck → more workers, pin_memory, pre-cache
├─ CPU idle too? → Launch overhead → torch.compile, larger batch, CUDA Graphs
│
GPU util 50–80%?
├─ Many tiny kernels → Fusion needed: torch.compile, FlashAttention
├─ HBM bandwidth saturated → Memory-bound: kernel fusion, bf16
│
GPU util > 80% but MFU < 30%?
├─ No mixed precision → Enable bf16
├─ Hand-written attention → F.scaled_dot_product_attention
└─ Misaligned dimensions → Pad to multiples of 8/64

I'm getting OOM

1. Enable AMP (fp32 → bf16 = ~2x memory saving)
2. Gradient checkpointing on Transformer blocks
3. Smaller batch + gradient accumulation
4. Reduce shape variance (bucketing)
5. FSDP / ZeRO to shard across GPUs
6. CPU offload (last resort)

Multi-GPU scaling is poor

1. Is communication exposed? (check Nsight Systems timeline)
2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view
3. Gradient accumulation + no_sync()
4. FSDP: check prefetch settings, sharding strategy
5. Network: NVLink vs PCIe? (nvidia-smi topo -m)
6. Data: balanced distribution across ranks?

🔢 Key Numbers

Data Center GPUs

GPU	BF16 Tensor TFLOPS	HBM BW (TB/s)	Memory	AI Crossover (FLOP/Byte)
A100 40GB	312	1.6	40 GB	~195
A100 80GB	312	2.0	80 GB	~156
A800 80GB	312	2.0	80 GB	~156
H100 SXM	989	3.35	80 GB	~295
H100 PCIe	756	2.0	80 GB	~378
H800 SXM	989	3.35	80 GB	~295
H200 SXM	989	4.8	141 GB	~206
B200 SXM	4,500	8.0	192 GB	~563
B300 SXM	4,500+	8.0	288 GB	~563
L40S	362	0.86	48 GB	~421

Consumer / Workstation GPUs

GPU	BF16 Tensor TFLOPS	VRAM BW (TB/s)	Memory	AI Crossover (FLOP/Byte)
RTX 4090	330	1.0	24 GB	~330
RTX 4080	194	0.72	16 GB	~269
RTX 3090	142	0.94	24 GB	~151

Operations below the crossover ratio are memory-bound — fusion matters more than faster compute.
Most element-wise ops are ~1–4 FLOP/Byte → almost always memory-bound.

Key insight: H200 vs H100 = same compute but 43% more bandwidth → better for LLM inference.
B200/B300 = ~4.5x compute over H100 but only ~2.4x bandwidth → the crossover point is much higher, making more operations compute-bound on Blackwell.

Memory Budget (Training)

Model with Φ parameters, bf16 + Adam:
  Parameters:  2Φ     (bf16)
  Gradients:   2Φ     (bf16)
  Optimizer:   12Φ    (fp32 copy + momentum + variance)
  Total:       ~16Φ   (+ activations)

  7B model → ~112 GB before activations
  → Won't fit on one 80GB GPU without FSDP/ZeRO

📚 References

Official documentation this guide is based on:

Contributing

Contributions welcome! If you have a trick that reliably speeds things up and you've measured it, open a PR. Please include:

What bottleneck it addresses (compute / memory / communication / launch overhead)
Measured before/after on a specific setup
Any risks or caveats

If you're publishing or promoting this repository, use the launch kit for:

GitHub repo description and topic suggestions
release checklist and demo ideas
ready-to-post launch copy for X, Hacker News, Reddit, LinkedIn, and Chinese dev communities

License

MIT — use it however you want.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
docs		docs
scripts		scripts
skill		skill
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Perf Playbook for PyTorch and LLM Workloads

What This Helps You Fix

What You Get

Install In Claude In 60 Seconds

Start Here

Who This Is For

Repository Layout

⚡ Quick Start: The P0 Checklist

🧭 Decision Trees

My training is slow

I'm getting OOM

Multi-GPU scaling is poor

🔢 Key Numbers

Data Center GPUs

Consumer / Workstation GPUs

Memory Budget (Training)

📚 References

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU Perf Playbook for PyTorch and LLM Workloads

What This Helps You Fix

What You Get

Install In Claude In 60 Seconds

Start Here

Who This Is For

Repository Layout

⚡ Quick Start: The P0 Checklist

🧭 Decision Trees

My training is slow

I'm getting OOM

Multi-GPU scaling is poor

🔢 Key Numbers

Data Center GPUs

Consumer / Workstation GPUs

Memory Budget (Training)

📚 References

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages