Skip to content

phtphtpht/gpu-perf-playbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Perf Playbook for PyTorch and LLM Workloads

Fix slow training, low GPU utilization, OOMs, bad multi-GPU scaling, and LLM inference bottlenecks.

English | 中文

GPU Perf Playbook banner

License: MIT Claude Skill PyTorch LLM Inference

This repository is for engineers who already know their model should be faster, but need a practical path to find the bottleneck and fix it.

Use it in two ways:

  • Read the playbook when you need a prioritized GPU optimization checklist.
  • Install the included Claude skill when you want Claude to reason from the same playbook automatically.

If this saves you debugging time, star the repo so more people can find it.


What This Helps You Fix

  • Training loops that are slow despite high-end GPUs
  • GPUs waiting on the CPU or DataLoader
  • OOMs caused by activations, optimizer state, or shape variance
  • Multi-GPU jobs with poor scaling efficiency
  • LLM inference stacks with bad latency, throughput, or KV cache strategy
  • Code reviews where performance bugs hide behind "correct" code

Philosophy: measure first, classify the bottleneck, apply the smallest fix that matters, then verify.


What You Get

Asset Why It Matters
Training Optimization P0/P1 defaults for PyTorch training: AMP/bf16, torch.compile, data pipelines, memory, checkpointing
Inference Optimization TTFT/ITL, KV cache, continuous batching, quantization, serving tradeoffs
Distributed Training DDP, FSDP/ZeRO, comm overlap, topology awareness, NCCL tuning
Kernel & Low-Level Triton, FlashAttention, fusion, memory access, roofline thinking
Profiling & Tools Nsight Systems, Nsight Compute, PyTorch Profiler, MFU, diagnostics
Anti-Patterns & Checklist 12 common mistakes, PR review prompts, pre-commit checks
Claude Skill Uploadable skill package for Claude.ai

Install In Claude In 60 Seconds

  1. Download skill/best-gpu-perf.skill.
  2. Open Claude.ai -> Settings -> Skills.
  3. Click Add Skill and upload the file.

Then try prompts like:

  • My training is slow on A100. Give me the highest-ROI things to check first.
  • Review this PyTorch training loop for GPU performance issues.
  • I'm hitting OOM on a 7B model. Walk me through the memory budget and likely fixes.
  • My multi-GPU scaling falls off after 4 GPUs. What should I profile?
  • Help me optimize LLM inference latency and throughput for a vLLM-style stack.

If you want to customize the skill, the source lives in skill/SKILL.md. Rebuild the packaged artifact with:

./scripts/build_skill.sh

Start Here

If your problem is... Read this first
Slow training docs/training.md
Slow inference / poor throughput docs/inference.md
Bad DDP/FSDP scaling docs/distributed.md
Kernel hot spots / fusion questions docs/kernel.md
You need profiler commands and interpretation help docs/profiling.md
You want a fast review checklist docs/checklist.md

Who This Is For

  • PyTorch engineers who need faster training without rewriting everything
  • LLM infra and serving engineers debugging latency, throughput, or KV cache behavior
  • Researchers who want a default optimization checklist before touching custom kernels
  • Reviewers who want to catch performance regressions in PRs

Repository Layout

  • docs/: topic-based reference guides
  • skill/SKILL.md: source for the Claude skill
  • skill/best-gpu-perf.skill: packaged skill artifact for upload
  • scripts/build_skill.sh: rebuilds the packaged skill from repo sources

⚡ Quick Start: The P0 Checklist

Before doing anything else, make sure your code has all of these. They're low-risk, high-reward, and apply to almost every PyTorch project:

import torch, os
from torch.utils.data import DataLoader

# 1. Enable TF32 (free speed on Ampere+)
torch.set_float32_matmul_precision("high")

# 2. DataLoader: don't let GPU wait for data
loader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=4,                # NOT 0
    pin_memory=True,              # Faster CPU→GPU transfer
    persistent_workers=True,      # Don't re-fork every epoch
)

# 3. Compile the model
model = torch.compile(model.cuda())

# 4. Fused optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, fused=True)

# 5. Training loop with AMP + best practices
for batch in loader:
    batch = {k: v.cuda(non_blocking=True) for k, v in batch.items()}
    optimizer.zero_grad(set_to_none=True)         # Not zero_grad()

    with torch.autocast("cuda", dtype=torch.bfloat16):  # 6. Mixed precision
        loss = model(**batch)

    loss.backward()
    optimizer.step()
    # ❌ Don't: print(loss), loss.item() every step, loss.cpu()

Also check:

  • Hidden dims / vocab size are multiples of 64
  • Attention uses F.scaled_dot_product_attention, not manual QK^TV
  • Eval uses @torch.inference_mode()
  • No .item() / .cpu() / print(tensor) in the training hot path

👉 Full checklist with 12 anti-patterns and PR template → docs/checklist.md


🧭 Decision Trees

My training is slow

GPU utilization < 50%?
├─ CPU at 100%? → DataLoader bottleneck → more workers, pin_memory, pre-cache
├─ CPU idle too? → Launch overhead → torch.compile, larger batch, CUDA Graphs
│
GPU util 50–80%?
├─ Many tiny kernels → Fusion needed: torch.compile, FlashAttention
├─ HBM bandwidth saturated → Memory-bound: kernel fusion, bf16
│
GPU util > 80% but MFU < 30%?
├─ No mixed precision → Enable bf16
├─ Hand-written attention → F.scaled_dot_product_attention
└─ Misaligned dimensions → Pad to multiples of 8/64

I'm getting OOM

1. Enable AMP (fp32 → bf16 = ~2x memory saving)
2. Gradient checkpointing on Transformer blocks
3. Smaller batch + gradient accumulation
4. Reduce shape variance (bucketing)
5. FSDP / ZeRO to shard across GPUs
6. CPU offload (last resort)

Multi-GPU scaling is poor

1. Is communication exposed? (check Nsight Systems timeline)
2. DDP: increase bucket_cap_mb, enable gradient_as_bucket_view
3. Gradient accumulation + no_sync()
4. FSDP: check prefetch settings, sharding strategy
5. Network: NVLink vs PCIe? (nvidia-smi topo -m)
6. Data: balanced distribution across ranks?

🔢 Key Numbers

Data Center GPUs

GPU BF16 Tensor TFLOPS HBM BW (TB/s) Memory AI Crossover (FLOP/Byte)
A100 40GB 312 1.6 40 GB ~195
A100 80GB 312 2.0 80 GB ~156
A800 80GB 312 2.0 80 GB ~156
H100 SXM 989 3.35 80 GB ~295
H100 PCIe 756 2.0 80 GB ~378
H800 SXM 989 3.35 80 GB ~295
H200 SXM 989 4.8 141 GB ~206
B200 SXM 4,500 8.0 192 GB ~563
B300 SXM 4,500+ 8.0 288 GB ~563
L40S 362 0.86 48 GB ~421

Consumer / Workstation GPUs

GPU BF16 Tensor TFLOPS VRAM BW (TB/s) Memory AI Crossover (FLOP/Byte)
RTX 4090 330 1.0 24 GB ~330
RTX 4080 194 0.72 16 GB ~269
RTX 3090 142 0.94 24 GB ~151

Operations below the crossover ratio are memory-bound — fusion matters more than faster compute.
Most element-wise ops are ~1–4 FLOP/Byte → almost always memory-bound.

Key insight: H200 vs H100 = same compute but 43% more bandwidth → better for LLM inference.
B200/B300 = ~4.5x compute over H100 but only ~2.4x bandwidth → the crossover point is much higher, making more operations compute-bound on Blackwell.

Memory Budget (Training)

Model with Φ parameters, bf16 + Adam:
  Parameters:  2Φ     (bf16)
  Gradients:   2Φ     (bf16)
  Optimizer:   12Φ    (fp32 copy + momentum + variance)
  Total:       ~16Φ   (+ activations)

  7B model → ~112 GB before activations
  → Won't fit on one 80GB GPU without FSDP/ZeRO

📚 References

Official documentation this guide is based on:


Contributing

Contributions welcome! If you have a trick that reliably speeds things up and you've measured it, open a PR. Please include:

  • What bottleneck it addresses (compute / memory / communication / launch overhead)
  • Measured before/after on a specific setup
  • Any risks or caveats

If you're publishing or promoting this repository, use the launch kit for:

  • GitHub repo description and topic suggestions
  • release checklist and demo ideas
  • ready-to-post launch copy for X, Hacker News, Reddit, LinkedIn, and Chinese dev communities

License

MIT — use it however you want.

About

GPU perf playbook + Claude skill for PyTorch training, LLM inference, profiling, OOM debugging, and distributed tuning.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages