Skip to content

Add corrected latency model design plans#2

Draft
susiejojo wants to merge 6 commits intoinference-sim:mainfrom
susiejojo:add/model-design-plans
Draft

Add corrected latency model design plans#2
susiejojo wants to merge 6 commits intoinference-sim:mainfrom
susiejojo:add/model-design-plans

Conversation

@susiejojo
Copy link
Copy Markdown
Collaborator

@susiejojo susiejojo commented Mar 6, 2026

Summary

  • Features design (corrected-roofline-features-design.md): 7-feature physically-motivated step time prediction formula covering dense (Llama), MoE (Mixtral), and MoE+MLA (DeepSeek-V2/V3) architectures. 10 total learnable parameters (7 beta + 3 alpha).
  • Training workloads (corrected-roofline-training-workloads.md): 5 workload profiles (prefill sweep, decode context sweep, batch scaling, prefill/decode ratio mix, TP scaling) designed for beta identifiability. ~180K+ steps across 4 model families.

Test plan

  • Review feature formulas against vLLM v0.8+ source
  • Validate MLA absorbed attention math against DeepSeek-V2/V3 papers
  • Confirm workload profiles are expressible via inference-perf config

Review results

All formulas verified correct against vLLM source, DeepSeek-V2/V3 papers, and published roofline literature (Pope et al., DuetServe, Vidur). 11 issues found and fixed in f768782:

Bugs fixed

  1. YAML schema — workload configs used wrong inference_perf: wrapper; fixed to actual inference-perf format (load:/data:/api:/server: top-level keys)
  2. Empty batch degenerate case — claimed StepTime = β₇, but F_weight_static is nonzero at B=0; corrected
  3. Missing ignore_eos: true — added to all workload configs (critical for controlled output lengths)
  4. W2 step count — was ~3K, actually ~180K (60 requests × 512 steps × 6 sweep points)

Documentation gaps fixed

  1. Added moe_layer_freq config field; generalized L_moe formula
  2. Clarified ~57× KV compression ratio uses MLA effective head dim (128), not d/H (56)
  3. Added W_DKV to DeepSeek-V3 F_pf_compute in architecture matrix
  4. Added dense_ffn (first_k_dense layers) to DeepSeek-V3 F_weight_static
  5. Fixed d_ff description: "Dense FFN" only (shared experts use d_ff_expert)
  6. Fixed W4 Profile A concurrency (rate 32→128, output_len 16→4)
  7. Renamed batch_sizenum_requests_per_step to avoid ambiguity with token counts

🤖 Generated with Claude Code

Feature design and training workload specs for physically-motivated
step time prediction: 7 features covering dense, MoE, and MLA architectures.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo
Copy link
Copy Markdown
Collaborator Author

susiejojo commented Mar 6, 2026

Intuition: what the formula does and why

The core idea

Every vLLM scheduler step does a mix of GPU compute, memory reads, and communication. The formula predicts step time by summing physically-motivated time estimates for each resource, each corrected by a learned scalar:

StepTime = β₁ · max(compute, bandwidth)_prefill
         + β₂ · max(compute, bandwidth)_decode
         + β₃ · weight_load_static
         + β₄ · weight_load_moe
         + β₅ · tp_allreduce
         + β₆ · num_requests_per_step
         + β₇

Why this works

Each feature (F_pf_compute, F_dc_kv, etc.) is calculated as "how long would this take at 100% hardware utilization?" — FLOPs divided by peak TFlops, or bytes divided by peak bandwidth. These are always optimistic (real hardware never hits peak), so the β coefficients correct for reality (β > 1 means "hardware runs at 1/β efficiency").

The max(compute, bandwidth) per phase is the roofline model: the GPU is bottlenecked by whichever resource saturates first. Prefill is typically compute-bound (large matrix multiplies); decode is typically memory-bound (reading KV cache for each token). The formula handles both regimes with the same structure.

How it generalizes across architectures

The formula is a sum of components that zero out when they don't apply:

Component Dense (Llama) MoE (Mixtral) MoE+MLA (DeepSeek-V3)
Prefill/decode compute+bandwidth ✓ (with absorbed attention)
Static weight loading attn + FFN attn only attn + shared experts
MoE weight loading 0 N_eff experts N_eff experts
KV cache bandwidth 2·kv_heads·d_h same compressed: kv_lora_rank + qk_rope_head_dim (~57× smaller)

A single set of 7 βs trained on mixed-architecture data captures all three families. The physics features encode what's different about each architecture; the βs encode how efficient the hardware is at each type of work.

The key insight vs. a blackbox model

A blackbox model (e.g., StepTime ~ f(num_prefill, num_decode, num_requests)) learns correlations specific to one model on one GPU. The roofline features inject causal structure: if you double the hidden dimension, the formula knows compute doubles and weight loading doubles — without retraining. The βs only need re-fitting when the hardware changes (and even then, 2-scalar few-shot calibration often suffices).

@susiejojo susiejojo marked this pull request as draft March 6, 2026 20:09
Features design doc:
- Fix empty batch degenerate case (F_weight_static nonzero at B=0)
- Add moe_layer_freq config field and generalize L_moe formula
- Clarify ~57x KV compression ratio uses effective head dim, not d/H
- Add W_DKV to DeepSeek-V3 F_pf_compute in architecture matrix
- Add dense_ffn (first_k_dense layers) to DeepSeek-V3 F_weight_static
- Fix d_ff description: "Dense FFN" only (shared experts use d_ff_expert)
- Rename batch_size → num_requests_per_step to avoid token count ambiguity

Training workloads doc:
- Fix YAML schema to match actual inference-perf config format
  (top-level load/data/api/server keys, not inference_perf wrapper)
- Add server.ignore_eos: true to all configs
- Fix W2 step count estimate (~3K → ~180K)
- Fix W4 Profile A concurrency (rate 32→128, output_len 16→4)
- Add data processing note about filtering empty steps
- Clarify rate→batch_size mapping is indirect and model-dependent

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo
Copy link
Copy Markdown
Collaborator Author

Test plan review results

Ran 5 parallel verification agents against the design documents, cross-referencing:

  • vLLM v0.8+ source code (deepseek_v2.py, llama.py, fused_moe/layer.py)
  • DeepSeek-V2 paper (arXiv 2405.04434) and DeepSeek-V3 paper (arXiv 2412.19437)
  • DeepSeek-V3 config.json on HuggingFace
  • inference-perf repo config schema and examples
  • Published roofline literature (Pope et al. 2022, DuetServe, Vidur, JAX Scaling Book)

Verification scope

Agent Focus Verdict
1 Standard MHA/GQA + dense FFN formulas vs vLLM ✅ All correct
2 MoE formulas (routed/shared experts, N_eff) vs vLLM ✅ All correct
3 MLA absorbed attention math vs DeepSeek papers ✅ All correct
4 Workload profiles vs inference-perf config ❌ YAML schema wrong
5 Numerical worked examples + architecture matrix ✅ Arithmetic correct, 4 doc gaps

Key findings

All physics/math is correct:

  • Every FLOP formula, bandwidth formula, and weight loading formula verified
  • All 7 worked example values match exact arithmetic
  • All DeepSeek-V3 config parameters confirmed against HuggingFace
  • MLA absorption mechanics verified against DeepSeek-V2 paper and vLLM implementation
  • ~57× KV compression ratio confirmed (32,768 / 576 = 56.9×)
  • Only approximation: Q projection as d² (~5% overestimate, explicitly documented as S8)

11 issues found and fixed in f768782:

# Severity Fix
1 High YAML configs → actual inference-perf schema
2 Medium Empty batch: β₃·F_weight_static + β₄·F_weight_moe + β₇ (not just β₇)
3 Medium Added server.ignore_eos: true to all configs
4 Low W2 step count: ~3K → ~180K
5 Low Added moe_layer_freq config field
6 Low Clarified ~57× uses effective head dim 128, not d/H=56
7 Low Added W_DKV to DeepSeek-V3 F_pf_compute in matrix
8 Low Added dense_ffn to DeepSeek-V3 F_weight_static in matrix
9 Low d_ff → "Dense FFN" only
10 Low W4-A: rate 32→128, output_len 16→4 for sufficient concurrency
11 Low batch_sizenum_requests_per_step

susiejojo and others added 2 commits March 9, 2026 11:01
220 runs across 10 (model, TP) combos × 22 sweeps per combo.
Links to workload specs in corrected-roofline-training-workloads.md.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…rkloads

- Data collection plan: 99 runs (90 training + 9 validation) across
  6 training combos + 3 validation models, ~3.3 hours sequential
- Active learning pipeline: iterative diagnose→target→refit loop
  with β health checks, residual analysis, and event-driven triggers
- Workloads: reduced to 15 runs/combo (4 W1 + 4 W2 + 5 W3 + 2 W4),
  added V1 validation sweep, renamed title to include validation
- Renamed batch_size → num_requests_per_step throughout

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo changed the title Add corrected roofline model design plans Add corrected latency model design plans Mar 9, 2026
susiejojo and others added 2 commits March 9, 2026 14:11
Saturated steps (KV cache >95%, preemption events) should be filtered
from β training data since they conflate GPU physics with scheduler
policy. α fitting data section documents that journey tracing events
from the same runs are used for queueing/preprocessing/output delay
coefficients.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
High-rate steps aren't "collected then filtered" — they serve α fitting
(queueing delays need saturation to have signal). Low-rate steps serve
β fitting. The rate sweep is a dual-purpose design where each end of
the range targets different model parameters.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant