Add corrected latency model design plans#2
Add corrected latency model design plans#2susiejojo wants to merge 6 commits intoinference-sim:mainfrom
Conversation
Feature design and training workload specs for physically-motivated step time prediction: 7 features covering dense, MoE, and MLA architectures. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Intuition: what the formula does and whyThe core ideaEvery vLLM scheduler step does a mix of GPU compute, memory reads, and communication. The formula predicts step time by summing physically-motivated time estimates for each resource, each corrected by a learned scalar: Why this worksEach feature ( The How it generalizes across architecturesThe formula is a sum of components that zero out when they don't apply:
A single set of 7 βs trained on mixed-architecture data captures all three families. The physics features encode what's different about each architecture; the βs encode how efficient the hardware is at each type of work. The key insight vs. a blackbox modelA blackbox model (e.g., |
Features design doc: - Fix empty batch degenerate case (F_weight_static nonzero at B=0) - Add moe_layer_freq config field and generalize L_moe formula - Clarify ~57x KV compression ratio uses effective head dim, not d/H - Add W_DKV to DeepSeek-V3 F_pf_compute in architecture matrix - Add dense_ffn (first_k_dense layers) to DeepSeek-V3 F_weight_static - Fix d_ff description: "Dense FFN" only (shared experts use d_ff_expert) - Rename batch_size → num_requests_per_step to avoid token count ambiguity Training workloads doc: - Fix YAML schema to match actual inference-perf config format (top-level load/data/api/server keys, not inference_perf wrapper) - Add server.ignore_eos: true to all configs - Fix W2 step count estimate (~3K → ~180K) - Fix W4 Profile A concurrency (rate 32→128, output_len 16→4) - Add data processing note about filtering empty steps - Clarify rate→batch_size mapping is indirect and model-dependent Co-Authored-By: Claude Opus 4.6 <[email protected]>
Test plan review resultsRan 5 parallel verification agents against the design documents, cross-referencing:
Verification scope
Key findingsAll physics/math is correct:
11 issues found and fixed in f768782:
|
220 runs across 10 (model, TP) combos × 22 sweeps per combo. Links to workload specs in corrected-roofline-training-workloads.md. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…rkloads - Data collection plan: 99 runs (90 training + 9 validation) across 6 training combos + 3 validation models, ~3.3 hours sequential - Active learning pipeline: iterative diagnose→target→refit loop with β health checks, residual analysis, and event-driven triggers - Workloads: reduced to 15 runs/combo (4 W1 + 4 W2 + 5 W3 + 2 W4), added V1 validation sweep, renamed title to include validation - Renamed batch_size → num_requests_per_step throughout Co-Authored-By: Claude Opus 4.6 <[email protected]>
Saturated steps (KV cache >95%, preemption events) should be filtered from β training data since they conflate GPU physics with scheduler policy. α fitting data section documents that journey tracing events from the same runs are used for queueing/preprocessing/output delay coefficients. Co-Authored-By: Claude Opus 4.6 <[email protected]>
High-rate steps aren't "collected then filtered" — they serve α fitting (queueing delays need saturation to have signal). Low-rate steps serve β fitting. The rate sweep is a dual-purpose design where each end of the range targets different model parameters. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Summary
corrected-roofline-features-design.md): 7-feature physically-motivated step time prediction formula covering dense (Llama), MoE (Mixtral), and MoE+MLA (DeepSeek-V2/V3) architectures. 10 total learnable parameters (7 beta + 3 alpha).corrected-roofline-training-workloads.md): 5 workload profiles (prefill sweep, decode context sweep, batch scaling, prefill/decode ratio mix, TP scaling) designed for beta identifiability. ~180K+ steps across 4 model families.Test plan
Review results
All formulas verified correct against vLLM source, DeepSeek-V2/V3 papers, and published roofline literature (Pope et al., DuetServe, Vidur). 11 issues found and fixed in
f768782:Bugs fixed
inference_perf:wrapper; fixed to actual inference-perf format (load:/data:/api:/server:top-level keys)ignore_eos: true— added to all workload configs (critical for controlled output lengths)Documentation gaps fixed
moe_layer_freqconfig field; generalized L_moe formulad_ffdescription: "Dense FFN" only (shared experts used_ff_expert)batch_size→num_requests_per_stepto avoid ambiguity with token counts🤖 Generated with Claude Code