Add corrected latency model design plans by susiejojo · Pull Request #2 · inference-sim/training

susiejojo · 2026-03-06T20:07:26Z

Summary

Features design (corrected-roofline-features-design.md): 7-feature physically-motivated step time prediction formula covering dense (Llama), MoE (Mixtral), and MoE+MLA (DeepSeek-V2/V3) architectures. 10 total learnable parameters (7 beta + 3 alpha).
Training workloads (corrected-roofline-training-workloads.md): 5 workload profiles (prefill sweep, decode context sweep, batch scaling, prefill/decode ratio mix, TP scaling) designed for beta identifiability. ~180K+ steps across 4 model families.

Test plan

Review feature formulas against vLLM v0.8+ source
Validate MLA absorbed attention math against DeepSeek-V2/V3 papers
Confirm workload profiles are expressible via inference-perf config

Review results

All formulas verified correct against vLLM source, DeepSeek-V2/V3 papers, and published roofline literature (Pope et al., DuetServe, Vidur). 11 issues found and fixed in f768782:

Bugs fixed

YAML schema — workload configs used wrong inference_perf: wrapper; fixed to actual inference-perf format (load:/data:/api:/server: top-level keys)
Empty batch degenerate case — claimed StepTime = β₇, but F_weight_static is nonzero at B=0; corrected
Missing ignore_eos: true — added to all workload configs (critical for controlled output lengths)
W2 step count — was ~3K, actually ~180K (60 requests × 512 steps × 6 sweep points)

Documentation gaps fixed

Added moe_layer_freq config field; generalized L_moe formula
Clarified ~57× KV compression ratio uses MLA effective head dim (128), not d/H (56)
Added W_DKV to DeepSeek-V3 F_pf_compute in architecture matrix
Added dense_ffn (first_k_dense layers) to DeepSeek-V3 F_weight_static
Fixed d_ff description: "Dense FFN" only (shared experts use d_ff_expert)
Fixed W4 Profile A concurrency (rate 32→128, output_len 16→4)
Renamed batch_size → num_requests_per_step to avoid ambiguity with token counts

🤖 Generated with Claude Code

Feature design and training workload specs for physically-motivated step time prediction: 7 features covering dense, MoE, and MLA architectures. Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo · 2026-03-06T20:08:39Z

Intuition: what the formula does and why

The core idea

Every vLLM scheduler step does a mix of GPU compute, memory reads, and communication. The formula predicts step time by summing physically-motivated time estimates for each resource, each corrected by a learned scalar:

StepTime = β₁ · max(compute, bandwidth)_prefill
         + β₂ · max(compute, bandwidth)_decode
         + β₃ · weight_load_static
         + β₄ · weight_load_moe
         + β₅ · tp_allreduce
         + β₆ · num_requests_per_step
         + β₇

Why this works

Each feature (F_pf_compute, F_dc_kv, etc.) is calculated as "how long would this take at 100% hardware utilization?" — FLOPs divided by peak TFlops, or bytes divided by peak bandwidth. These are always optimistic (real hardware never hits peak), so the β coefficients correct for reality (β > 1 means "hardware runs at 1/β efficiency").

The max(compute, bandwidth) per phase is the roofline model: the GPU is bottlenecked by whichever resource saturates first. Prefill is typically compute-bound (large matrix multiplies); decode is typically memory-bound (reading KV cache for each token). The formula handles both regimes with the same structure.

How it generalizes across architectures

The formula is a sum of components that zero out when they don't apply:

Component	Dense (Llama)	MoE (Mixtral)	MoE+MLA (DeepSeek-V3)
Prefill/decode compute+bandwidth	✓	✓	✓ (with absorbed attention)
Static weight loading	attn + FFN	attn only	attn + shared experts
MoE weight loading	0	N_eff experts	N_eff experts
KV cache bandwidth	2·kv_heads·d_h	same	compressed: kv_lora_rank + qk_rope_head_dim (~57× smaller)

A single set of 7 βs trained on mixed-architecture data captures all three families. The physics features encode what's different about each architecture; the βs encode how efficient the hardware is at each type of work.

The key insight vs. a blackbox model

A blackbox model (e.g., StepTime ~ f(num_prefill, num_decode, num_requests)) learns correlations specific to one model on one GPU. The roofline features inject causal structure: if you double the hidden dimension, the formula knows compute doubles and weight loading doubles — without retraining. The βs only need re-fitting when the hardware changes (and even then, 2-scalar few-shot calibration often suffices).

Features design doc: - Fix empty batch degenerate case (F_weight_static nonzero at B=0) - Add moe_layer_freq config field and generalize L_moe formula - Clarify ~57x KV compression ratio uses effective head dim, not d/H - Add W_DKV to DeepSeek-V3 F_pf_compute in architecture matrix - Add dense_ffn (first_k_dense layers) to DeepSeek-V3 F_weight_static - Fix d_ff description: "Dense FFN" only (shared experts use d_ff_expert) - Rename batch_size → num_requests_per_step to avoid token count ambiguity Training workloads doc: - Fix YAML schema to match actual inference-perf config format (top-level load/data/api/server keys, not inference_perf wrapper) - Add server.ignore_eos: true to all configs - Fix W2 step count estimate (~3K → ~180K) - Fix W4 Profile A concurrency (rate 32→128, output_len 16→4) - Add data processing note about filtering empty steps - Clarify rate→batch_size mapping is indirect and model-dependent Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo · 2026-03-06T20:48:12Z

Test plan review results

Ran 5 parallel verification agents against the design documents, cross-referencing:

vLLM v0.8+ source code (deepseek_v2.py, llama.py, fused_moe/layer.py)
DeepSeek-V2 paper (arXiv 2405.04434) and DeepSeek-V3 paper (arXiv 2412.19437)
DeepSeek-V3 config.json on HuggingFace
inference-perf repo config schema and examples
Published roofline literature (Pope et al. 2022, DuetServe, Vidur, JAX Scaling Book)

Verification scope

Agent	Focus	Verdict
1	Standard MHA/GQA + dense FFN formulas vs vLLM	✅ All correct
2	MoE formulas (routed/shared experts, N_eff) vs vLLM	✅ All correct
3	MLA absorbed attention math vs DeepSeek papers	✅ All correct
4	Workload profiles vs inference-perf config	❌ YAML schema wrong
5	Numerical worked examples + architecture matrix	✅ Arithmetic correct, 4 doc gaps

Key findings

All physics/math is correct:

Every FLOP formula, bandwidth formula, and weight loading formula verified
All 7 worked example values match exact arithmetic
All DeepSeek-V3 config parameters confirmed against HuggingFace
MLA absorption mechanics verified against DeepSeek-V2 paper and vLLM implementation
~57× KV compression ratio confirmed (32,768 / 576 = 56.9×)
Only approximation: Q projection as d² (~5% overestimate, explicitly documented as S8)

11 issues found and fixed in f768782:

#	Severity	Fix
1	High	YAML configs → actual inference-perf schema
2	Medium	Empty batch: `β₃·F_weight_static + β₄·F_weight_moe + β₇` (not just β₇)
3	Medium	Added `server.ignore_eos: true` to all configs
4	Low	W2 step count: ~3K → ~180K
5	Low	Added `moe_layer_freq` config field
6	Low	Clarified ~57× uses effective head dim 128, not d/H=56
7	Low	Added W_DKV to DeepSeek-V3 F_pf_compute in matrix
8	Low	Added dense_ffn to DeepSeek-V3 F_weight_static in matrix
9	Low	`d_ff` → "Dense FFN" only
10	Low	W4-A: rate 32→128, output_len 16→4 for sufficient concurrency
11	Low	`batch_size` → `num_requests_per_step`

220 runs across 10 (model, TP) combos × 22 sweeps per combo. Links to workload specs in corrected-roofline-training-workloads.md. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…rkloads - Data collection plan: 99 runs (90 training + 9 validation) across 6 training combos + 3 validation models, ~3.3 hours sequential - Active learning pipeline: iterative diagnose→target→refit loop with β health checks, residual analysis, and event-driven triggers - Workloads: reduced to 15 runs/combo (4 W1 + 4 W2 + 5 W3 + 2 W4), added V1 validation sweep, renamed title to include validation - Renamed batch_size → num_requests_per_step throughout Co-Authored-By: Claude Opus 4.6 <[email protected]>

Saturated steps (KV cache >95%, preemption events) should be filtered from β training data since they conflate GPU physics with scheduler policy. α fitting data section documents that journey tracing events from the same runs are used for queueing/preprocessing/output delay coefficients. Co-Authored-By: Claude Opus 4.6 <[email protected]>

High-rate steps aren't "collected then filtered" — they serve α fitting (queueing delays need saturation to have signal). Low-rate steps serve β fitting. The rate sweep is a dual-purpose design where each end of the range targets different model parameters. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add corrected roofline model design plans

e5e0d73

Feature design and training workload specs for physically-motivated step time prediction: 7 features covering dense, MoE, and MLA architectures. Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo marked this pull request as draft March 6, 2026 20:09

susiejojo and others added 2 commits March 9, 2026 11:01

Add concise experiment plan for roofline data collection

4653ac3

220 runs across 10 (model, TP) combos × 22 sweeps per combo. Links to workload specs in corrected-roofline-training-workloads.md. Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo changed the title ~~Add corrected roofline model design plans~~ Add corrected latency model design plans Mar 9, 2026

susiejojo and others added 2 commits March 9, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add corrected latency model design plans#2

Add corrected latency model design plans#2
susiejojo wants to merge 6 commits intoinference-sim:mainfrom
susiejojo:add/model-design-plans

susiejojo commented Mar 6, 2026 •

edited

Loading

Uh oh!

susiejojo commented Mar 6, 2026 •

edited

Loading

Uh oh!

susiejojo commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

susiejojo commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Review results

Bugs fixed

Documentation gaps fixed

Uh oh!

susiejojo commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intuition: what the formula does and why

The core idea

Why this works

How it generalizes across architectures

The key insight vs. a blackbox model

Uh oh!

susiejojo commented Mar 6, 2026

Test plan review results

Verification scope

Key findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

susiejojo commented Mar 6, 2026 •

edited

Loading

susiejojo commented Mar 6, 2026 •

edited

Loading