feat(latency): add trained-roofline backend with roofline basis functions × learned corrections by sriumcp · Pull Request #616 · inference-sim/inference-sim

sriumcp · 2026-03-11T03:19:34Z

Summary

Adds a 4th latency model backend (--latency-model trained-roofline) that applies learned correction factors (β₁-β₇) to analytical roofline basis functions, fitted from 137K real vLLM requests across 4 architectures via NNLS
Adds PostDecodeFixedOverhead() method to the LatencyModel interface for correct α₁ (post-decode fixed overhead) modeling — existing backends return 0 (backward compatible)
7% MAPE on GPU combined step time (test split) across Llama-2-7b, Llama-2-70b, Mixtral-8x7B, CodeLlama-34b
Zero heap allocations in StepTime (19ns/op, 0 allocs/op verified by testing.AllocsPerRun)

Behavioral Contracts

BC-1: IsValidLatencyBackend("trained-roofline") returns true
BC-2: NewLatencyModel with Backend="trained-roofline" returns valid model
BC-3: StepTime = β₁·max(T_pf_compute, T_pf_kv) + β₂·max(T_dc_compute, T_dc_kv) + β₃·T_weight + β₄·T_tp + β₅·L + β₆·batchSize + β₇
BC-4/5: Prefill/decode monotonicity (more tokens → non-decreasing step time)
BC-6: StepTime ≥ 1 for all inputs (INV-3)
BC-7: QueueingTime = α₀ (constant API processing overhead)
BC-8: OutputTokenProcessingTime = α₂ (per-token detokenization)
BC-9: MoE weight loading uses min(N, max(k, B·k)) effective experts
BC-10: Coefficients loaded from trained_roofline_defaults in defaults.yaml
BC-11: No MFU scaling (β₁/β₂ ARE the corrections)
BC-12: Existing backends byte-identical (backward compatible)
BC-13/14: Coefficient length and config validation with descriptive errors
BC-15: PostDecodeFixedOverhead = α₁ (fixed per-request post-decode overhead)

Test plan

go test ./... — all 9 packages pass
golangci-lint run ./... — 0 issues
Feature fidelity: all 6 basis functions verified term-by-term against training/basis_functions.py
Coefficient values match training/output/fit/coefficients.json to full float64 precision
Zero-allocation enforcement: TestTrainedRoofline_StepTime_ZeroAllocs via testing.AllocsPerRun
Benchmark: 19ns/op, 0 allocs/op
Plan convergence: 4 rounds
Code convergence: 2 rounds (10 perspectives each)
Pre-commit self-audit: all 10 dimensions

Discovered Issues

trained-roofline: add TP>1 and GQA factory-path tests #610 — TP>1 and GQA factory-path tests (enhancement)
trained-roofline: implement T_tp basis function when β₄ becomes nonzero #611 — T_tp basis function when β₄ becomes nonzero (enhancement)
trained-roofline: MoEExpertFFNDim mismatch for Qwen2-MoE-style models #612 — MoEExpertFFNDim mismatch for Qwen2-MoE (design)
trained-roofline: add full-formula regression anchor test (BC-3) #613 — Full-formula regression anchor test (enhancement)
latency: add int64 overflow guard in StepTime for extreme hardware configs #614 — int64 overflow guard in StepTime (hardening, pre-existing all backends)
cli: add NaN/Inf validation for --alpha-coeffs and --beta-coeffs CLI flags #615 — NaN/Inf CLI validation for coefficient flags (hardening, pre-existing all backends)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…rainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…x ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

… guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…lloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…head pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ined-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Merges 11 commits from main into pd, including: - trained-roofline latency backend (inference-sim#616) - MaxOutputLen population + engine auto-fill (inference-sim#621) - MaxModelLen int64 + rope_scaling extraction (inference-sim#606) - MoE-aware roofline latency (inference-sim#561) - MaxModelLen enforcement + oracle knowledge boundary (inference-sim#579, inference-sim#587) - Dead HardwareCalib fields removal (inference-sim#596) - Default example model switch to Qwen3-14B (inference-sim#608) - CI/CD updates (inference-sim#600, inference-sim#601, inference-sim#607) Conflict resolutions: - CLAUDE.md: kept both INV-9 (main) and INV-PD-* (pd) invariants - cmd/root.go: merged LengthCappedRequests counter + DroppedKVAllocations - sim/bundle.go: added trained-roofline backend + kept disaggregation deciders - sim/cluster/metrics.go: added LengthCappedRequests field - docs: merged invariants and results documentation from both branches - sim/cluster/disaggregation_test.go: added MaxModelLen param to NewModelHardwareConfig calls - sim/cluster/cluster_event.go: rewrote comment to avoid INV-9 test false positive Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579) Add vLLM-equivalent max_model_len enforcement at three layers: 1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks 2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542); input + MaxOutputLen > MaxModelLen rejected when client declares budget 3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth) Key design decisions: - Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens. Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles output growth. Verified by behavioral + structural grep tests. - Auto-derive from HF max_position_embeddings for roofline/crossmodel backends, with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn special-case using original_max_position_embeddings, and KV-feasible capping. - Overflow-safe ceiling division in startup validation (R11). - R3 validation at CLI (logrus.Fatalf) and constructor (panic). New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests (input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement. Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel. Blackbox gap tracked in #578. Closes: #567 Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): MoE-aware roofline latency model (#559) (#561) * feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig Two new fields for MoE-aware roofline: per-routed-expert FFN dimension and total shared-expert FFN dimension. Both default to 0 (dense model). Zero-value safe for all existing construction sites (R4 audit: all dense model configs use zero-valued MoE fields). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): parse MoE per-expert and shared-expert dims from HF config Extends GetModelConfigFromHF to parse moe_intermediate_size, shared_expert_intermediate_size, and n_shared_experts. Expert count resolution chain extended to include num_routed_experts (DeepSeek-V3). Implements BC-15 through BC-18 from the MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): add MoE consistency validation to ValidateRooflineConfig Validates: experts>0 requires active>0, active<=total, non-negative MoE dimensions. Catches inconsistent MoE configs at construction time. Implements BC-12, BC-13, BC-14 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(sim): address convergence review findings (I-1, I-2) I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size (matches HF config field name convention, consistent with other tags). I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig (R3 compliance — all numeric parameters validated). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests MoE FLOPs (Task 4): calculateTransformerFlops now computes routed (top_k), shared, and gate MLP FLOPs for MoE models. Dense models use unchanged code path (NumLocalExperts=0 guard). Active weights (Task 5): calculateMemoryAccessBytes uses top_k (active experts) for per-step weight bandwidth, matching vLLM's fused_moe kernel behavior. Includes shared expert and gate weights. Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs). Implements BC-1 through BC-6, BC-10 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation Fixes the critical bug where DeepSeek-V3's general intermediate_size (18432) was used as per-expert dim (should be 2048), overestimating MLP weights by ~9× and returning zero usable KV blocks. Changes: - KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields - NewKVCapacityParams gains 2 new positional args (R4 enforced) - computeModelWeightBytes uses per-expert dim when nonzero, falls back to IntermediateDim (Mixtral convention) - ExtractKVCapacityParams propagates new fields, extends expert count chain to include num_routed_experts (parity with GetModelConfigFromHF) Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation), BC-11 (dense unchanged). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence review round 2 — R23 parity, documentation, R15 I-1: Align expert count resolution threshold between GetModelConfigFromHF and ExtractKVCapacityParams. Both now use >1 threshold (single-expert models are dense-equivalent). Fixes R23 code path parity violation. I-2: Add precondition comments to calculateTransformerFlops and calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement. I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)). I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg). I-5: Document active vs total weight distinction in calculateMemoryAccessBytes to prevent future R23 regression. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): align MoE threshold to > 1 across all consumption paths (R23) Parsing layer already used > 1 (single-expert models are dense-equivalent). Consumption paths (calculateTransformerFlops, calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig, computeModelWeightBytes) now use > 1 as well, matching the documented design intent and resolving the R23 code path parity violation. Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1"). Round 3 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan - cmd/root.go: Replace misleading "assumes dense transformers" warning with accurate MoE info message (roofline now models per-expert FLOPs) - sim/model_hardware_config.go: Run gofmt to fix struct field alignment - docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note for threshold change (> 0 → > 1) Round 4 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * refactor(latency): port llm-optimizer single-crossover roofline physics Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover: step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth) Remove bandwidth haircut (BwEffConstant no longer used in step time). Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency). Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU 3-matrix MLP, MoE support, FlashAttention-aware memory model. Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%. The dual ceiling + bandwidth haircut + overhead stacking caused ~3x systematic over-prediction for memory-bound decode steps. Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: update MFU values to llm-optimizer defaults (0.45/0.30) MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries. These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE on the sim-to-real evaluation (discussion #522). Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged for backward compatibility — they are no longer used by rooflineStepTime() but may be consumed by other callers. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add roofline llm-optimizer port design and implementation plan Design doc: decision record for porting llm-optimizer's single-crossover roofline physics into BLIS. Implementation plan: 3 tasks (physics rewrite, MFU update, verification). Motivation: discussion #522 sim-to-real accuracy validation. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): load weights once per step in roofline (unified forward pass) vLLM chunked prefill processes all tokens (prefill + decode) in a single forward pass — weights are loaded from HBM once per step, not once per phase. The previous implementation loaded weights independently for prefill and decode phases, doubling the memory-bound term for mixed batches (~2x over-prediction). Sources: vLLM V1 blog ("all selected requests are flattened and concatenated into one long super-sequence for that single forward pass"), Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is amortized across all prompts in a batch"). Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies the overhead of adding prefill to a decode step is much less than a full weight load (7µs vs 4166µs). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both calculateTransformerFlops and calculateMemoryAccessBytes, matching llm-optimizer's formulation. For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix formula produced 31% more MLP weight bytes than llm-optimizer's 2-matrix formula, directly inflating memory-bound decode predictions. Applies to both dense and MoE paths (routed + shared expert FLOPs/weights). Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%) MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%) Motivation: after porting llm-optimizer single-crossover physics, BLIS roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed H100 tensor core utilization for large prefill GEMMs and batched decode. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump to 0.55/0.35 went the wrong direction (both models under-predict). Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the 2-matrix formula copied from llm-optimizer. SwiGLU actually has 3 weight matrices that all need HBM loading: this is the physically correct formula and increases weight bytes by ~37%, which reduces the under-prediction from ~50% toward the target. Dense and MoE paths both updated consistently (R23). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): conditional SwiGLU detection via HiddenAct field Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu) or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act field. Empty defaults to SwiGLU since most modern LLMs use it. Both calculateTransformerFlops and calculateMemoryAccessBytes now use nMat instead of hardcoded 3, correctly handling non-SwiGLU models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): revert to 2-matrix MLP convention matching llm-optimizer 3-matrix with raw intermediate_size over-predicts for models like Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU (2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach where 2 × d × intermediate ≈ physical weight count for most models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): remove MoE-specific branches from roofline step time Roofline now treats MoE models identically to dense (matching llm-optimizer which has no MoE-specific handling). MoE fields (NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still used by KV capacity (kv_capacity.go) for GPU memory budgeting. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): MoE roofline scales weights by E, FLOPs by top_k Mixtral was under-predicted by ~10x because the dense treatment loaded 1 expert's MLP weights instead of all 8. Fix: - Weight bandwidth: E × MLP weights (all experts loaded from HBM per step) - FLOPs: top_k × MLP FLOPs (only active experts compute per token) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use MoEExpertFFNDim in roofline when set For DeepSeek-V3 style models where intermediate_size (18432) differs from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and FLOP calculations. Falls back to IntermediateDim when unset (Mixtral). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR #561 review — revert crossmodel scope, fix docs - Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation: crossmodel behavioral change doesn't belong in a roofline PR) - Fix design doc table and CLI comment claiming roofline models shared experts and gate FLOPs (it doesn't — only KV capacity does) - Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs 2-matrix MLP (mlpMatrixCount always returns 2) - Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity) design choice with cross-references in both files Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587) - Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass) - Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4) - Add end-to-end sim.Run() test for BC-5 runtime length cap path - Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6) - Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap) - Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM) - Add rope_scaling parse-failure warnings for malformed HF configs (BC-8) - Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16) Fixes #580 Co-authored-by: Claude <[email protected]> * refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596) These fields became dead code after the roofline physics port (llm-optimizer single-crossover model). No runtime code path reads them; ValidateRooflineConfig enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates config-file clutter and prevents future contributors from assuming they're active. Fixes #590 Co-authored-by: Claude Opus 4.6 <[email protected]> * Configure claude on GH Actions (#600) Signed-off-by: Jing Chen <[email protected]> * Enable claude on PRs (#601) Signed-off-by: Jing Chen <[email protected]> * ignore training and actions runner (#607) Signed-off-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606) Complete 7 deferred hardening items from issue #580 (PR #587 handoff): 1. Extract applyRopeScaling as a pure function with 26 table-driven test cases covering blacklist (su/longrope/llama3), mrope fall-through, gemma3 substring match (handles text_config pivot), yarn original base, overflow guards, NaN/Inf defense, degenerate inputs. 2. Change MaxModelLen from int to int64 for consistency with ProgressIndex, TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant int64() casts, adds int64() widening at EnqueueRequest comparison sites. 3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit) and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests drain, Metrics.Requests map cleanup. 4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no spurious force-completion during multi-chunk prefill (TotalOutputTokens=49, LengthCappedRequests=0, TTFT recorded). 5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9). 6. Refine rope_scaling documentation with explicit blacklist details. 7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites model_type from "gemma3" to "gemma3_text", making the exact-match check dead code. Changed to strings.Contains to match vLLM's substring semantics. Related to #580. Discovered issues: #602, #603, #604, #605. Co-authored-by: Claude <[email protected]> * feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616) * feat(latency): register trained-roofline backend name (BC-1) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12) - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to latency model documentation - CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): comprehensive trained-roofline section in latency models guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to quickstart + document non-blocking overhead pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621) * feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4) - Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client omits budget (MaxOutputLen==0) and maxModelLen > 0 - Mirrors vLLM input_processor.py:554 safety cap - No auto-fill when client sets budget (BC-2), unlimited mode (BC-3), or input exceeds context (BC-4) Refs: #572 Co-Authored-By: Claude <[email protected]> * feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7) - generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal) - replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay) - reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning) - Matches inference-perf pattern: max_tokens = sampled output length Fixes #572 Co-Authored-By: Claude <[email protected]> * docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing Co-Authored-By: Claude <[email protected]> * docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics - Three tests referenced 'input-only check' for MaxOutputLen=0 - After auto-fill, MaxOutputLen is set to maxModelLen - input - Tests still pass numerically; comments now reflect actual behavior Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * docs: switch default example model to public Qwen/Qwen3-14B (#608) * docs: switch default example model to public qwen/qwen2.5-7b-instruct Replace gated meta-llama/llama-3.1-8b-instruct with publicly available qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart, tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel examples now work without HF authentication. Set qwen default TP=1 in defaults.yaml so examples use the default without explicit --tp flags. Update KV block count, coefficient examples, and prose references to match TP=1 values. Fixes #545 Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1) Update default and trained-coefficient vllm_version for qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: switch default example model from qwen2.5-7b to qwen3-14b Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with pre-trained coefficients already in defaults.yaml. Update all documentation examples and references accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address review comments — stale refs, tutorial throughput - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add HF_TOKEN tip to quickstart and README for gated models Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace, which fails for gated models without authentication. Add a lightweight tip after the first roofline example in both files recommending HF_TOKEN for gated model access and rate limit avoidance. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Signed-off-by: Jing Chen <[email protected]> Signed-off-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Jing Chen <[email protected]>

sriumcp and others added 11 commits March 10, 2026 23:20

feat(latency): register trained-roofline backend name (BC-1)

0d6e7c3

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

test(latency): add monotonicity behavioral tests for trained-roofline…

e01696a

… (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

docs: self-audit — update models.md, roofline.md, tutorial.md for tra…

88c361a

…ined-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

sriumcp force-pushed the trained-roofline-backend branch from b10ad79 to 88c361a Compare March 11, 2026 03:24

sriumcp merged commit 36c8325 into inference-sim:main Mar 11, 2026
4 checks passed

namasl mentioned this pull request Mar 12, 2026

merge: bring pd branch up to date with main #626

Merged

5 tasks

sriumcp deleted the trained-roofline-backend branch March 19, 2026 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(latency): add trained-roofline backend with roofline basis functions × learned corrections#616

feat(latency): add trained-roofline backend with roofline basis functions × learned corrections#616
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:trained-roofline-backend

sriumcp commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Mar 11, 2026

Summary

Behavioral Contracts

Test plan

Discovered Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant