feat(latency): add trained-roofline backend with roofline basis functions × learned corrections#616
Merged
sriumcp merged 11 commits intoinference-sim:mainfrom Mar 11, 2026
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…rainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…x ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…lloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…head pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ined-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
b10ad79 to
88c361a
Compare
namasl
added a commit
to namasl/inference-sim
that referenced
this pull request
Mar 12, 2026
Merges 11 commits from main into pd, including: - trained-roofline latency backend (inference-sim#616) - MaxOutputLen population + engine auto-fill (inference-sim#621) - MaxModelLen int64 + rope_scaling extraction (inference-sim#606) - MoE-aware roofline latency (inference-sim#561) - MaxModelLen enforcement + oracle knowledge boundary (inference-sim#579, inference-sim#587) - Dead HardwareCalib fields removal (inference-sim#596) - Default example model switch to Qwen3-14B (inference-sim#608) - CI/CD updates (inference-sim#600, inference-sim#601, inference-sim#607) Conflict resolutions: - CLAUDE.md: kept both INV-9 (main) and INV-PD-* (pd) invariants - cmd/root.go: merged LengthCappedRequests counter + DroppedKVAllocations - sim/bundle.go: added trained-roofline backend + kept disaggregation deciders - sim/cluster/metrics.go: added LengthCappedRequests field - docs: merged invariants and results documentation from both branches - sim/cluster/disaggregation_test.go: added MaxModelLen param to NewModelHardwareConfig calls - sim/cluster/cluster_event.go: rewrote comment to avoid INV-9 test false positive Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
5 tasks
namasl
added a commit
that referenced
this pull request
Mar 12, 2026
* feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579) Add vLLM-equivalent max_model_len enforcement at three layers: 1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks 2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542); input + MaxOutputLen > MaxModelLen rejected when client declares budget 3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth) Key design decisions: - Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens. Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles output growth. Verified by behavioral + structural grep tests. - Auto-derive from HF max_position_embeddings for roofline/crossmodel backends, with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn special-case using original_max_position_embeddings, and KV-feasible capping. - Overflow-safe ceiling division in startup validation (R11). - R3 validation at CLI (logrus.Fatalf) and constructor (panic). New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests (input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement. Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel. Blackbox gap tracked in #578. Closes: #567 Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): MoE-aware roofline latency model (#559) (#561) * feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig Two new fields for MoE-aware roofline: per-routed-expert FFN dimension and total shared-expert FFN dimension. Both default to 0 (dense model). Zero-value safe for all existing construction sites (R4 audit: all dense model configs use zero-valued MoE fields). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): parse MoE per-expert and shared-expert dims from HF config Extends GetModelConfigFromHF to parse moe_intermediate_size, shared_expert_intermediate_size, and n_shared_experts. Expert count resolution chain extended to include num_routed_experts (DeepSeek-V3). Implements BC-15 through BC-18 from the MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): add MoE consistency validation to ValidateRooflineConfig Validates: experts>0 requires active>0, active<=total, non-negative MoE dimensions. Catches inconsistent MoE configs at construction time. Implements BC-12, BC-13, BC-14 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(sim): address convergence review findings (I-1, I-2) I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size (matches HF config field name convention, consistent with other tags). I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig (R3 compliance — all numeric parameters validated). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests MoE FLOPs (Task 4): calculateTransformerFlops now computes routed (top_k), shared, and gate MLP FLOPs for MoE models. Dense models use unchanged code path (NumLocalExperts=0 guard). Active weights (Task 5): calculateMemoryAccessBytes uses top_k (active experts) for per-step weight bandwidth, matching vLLM's fused_moe kernel behavior. Includes shared expert and gate weights. Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs). Implements BC-1 through BC-6, BC-10 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation Fixes the critical bug where DeepSeek-V3's general intermediate_size (18432) was used as per-expert dim (should be 2048), overestimating MLP weights by ~9× and returning zero usable KV blocks. Changes: - KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields - NewKVCapacityParams gains 2 new positional args (R4 enforced) - computeModelWeightBytes uses per-expert dim when nonzero, falls back to IntermediateDim (Mixtral convention) - ExtractKVCapacityParams propagates new fields, extends expert count chain to include num_routed_experts (parity with GetModelConfigFromHF) Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation), BC-11 (dense unchanged). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence review round 2 — R23 parity, documentation, R15 I-1: Align expert count resolution threshold between GetModelConfigFromHF and ExtractKVCapacityParams. Both now use >1 threshold (single-expert models are dense-equivalent). Fixes R23 code path parity violation. I-2: Add precondition comments to calculateTransformerFlops and calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement. I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)). I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg). I-5: Document active vs total weight distinction in calculateMemoryAccessBytes to prevent future R23 regression. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): align MoE threshold to > 1 across all consumption paths (R23) Parsing layer already used > 1 (single-expert models are dense-equivalent). Consumption paths (calculateTransformerFlops, calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig, computeModelWeightBytes) now use > 1 as well, matching the documented design intent and resolving the R23 code path parity violation. Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1"). Round 3 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan - cmd/root.go: Replace misleading "assumes dense transformers" warning with accurate MoE info message (roofline now models per-expert FLOPs) - sim/model_hardware_config.go: Run gofmt to fix struct field alignment - docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note for threshold change (> 0 → > 1) Round 4 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * refactor(latency): port llm-optimizer single-crossover roofline physics Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover: step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth) Remove bandwidth haircut (BwEffConstant no longer used in step time). Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency). Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU 3-matrix MLP, MoE support, FlashAttention-aware memory model. Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%. The dual ceiling + bandwidth haircut + overhead stacking caused ~3x systematic over-prediction for memory-bound decode steps. Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: update MFU values to llm-optimizer defaults (0.45/0.30) MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries. These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE on the sim-to-real evaluation (discussion #522). Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged for backward compatibility — they are no longer used by rooflineStepTime() but may be consumed by other callers. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add roofline llm-optimizer port design and implementation plan Design doc: decision record for porting llm-optimizer's single-crossover roofline physics into BLIS. Implementation plan: 3 tasks (physics rewrite, MFU update, verification). Motivation: discussion #522 sim-to-real accuracy validation. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): load weights once per step in roofline (unified forward pass) vLLM chunked prefill processes all tokens (prefill + decode) in a single forward pass — weights are loaded from HBM once per step, not once per phase. The previous implementation loaded weights independently for prefill and decode phases, doubling the memory-bound term for mixed batches (~2x over-prediction). Sources: vLLM V1 blog ("all selected requests are flattened and concatenated into one long super-sequence for that single forward pass"), Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is amortized across all prompts in a batch"). Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies the overhead of adding prefill to a decode step is much less than a full weight load (7µs vs 4166µs). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both calculateTransformerFlops and calculateMemoryAccessBytes, matching llm-optimizer's formulation. For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix formula produced 31% more MLP weight bytes than llm-optimizer's 2-matrix formula, directly inflating memory-bound decode predictions. Applies to both dense and MoE paths (routed + shared expert FLOPs/weights). Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%) MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%) Motivation: after porting llm-optimizer single-crossover physics, BLIS roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed H100 tensor core utilization for large prefill GEMMs and batched decode. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump to 0.55/0.35 went the wrong direction (both models under-predict). Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the 2-matrix formula copied from llm-optimizer. SwiGLU actually has 3 weight matrices that all need HBM loading: this is the physically correct formula and increases weight bytes by ~37%, which reduces the under-prediction from ~50% toward the target. Dense and MoE paths both updated consistently (R23). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): conditional SwiGLU detection via HiddenAct field Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu) or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act field. Empty defaults to SwiGLU since most modern LLMs use it. Both calculateTransformerFlops and calculateMemoryAccessBytes now use nMat instead of hardcoded 3, correctly handling non-SwiGLU models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): revert to 2-matrix MLP convention matching llm-optimizer 3-matrix with raw intermediate_size over-predicts for models like Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU (2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach where 2 × d × intermediate ≈ physical weight count for most models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): remove MoE-specific branches from roofline step time Roofline now treats MoE models identically to dense (matching llm-optimizer which has no MoE-specific handling). MoE fields (NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still used by KV capacity (kv_capacity.go) for GPU memory budgeting. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): MoE roofline scales weights by E, FLOPs by top_k Mixtral was under-predicted by ~10x because the dense treatment loaded 1 expert's MLP weights instead of all 8. Fix: - Weight bandwidth: E × MLP weights (all experts loaded from HBM per step) - FLOPs: top_k × MLP FLOPs (only active experts compute per token) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use MoEExpertFFNDim in roofline when set For DeepSeek-V3 style models where intermediate_size (18432) differs from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and FLOP calculations. Falls back to IntermediateDim when unset (Mixtral). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR #561 review — revert crossmodel scope, fix docs - Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation: crossmodel behavioral change doesn't belong in a roofline PR) - Fix design doc table and CLI comment claiming roofline models shared experts and gate FLOPs (it doesn't — only KV capacity does) - Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs 2-matrix MLP (mlpMatrixCount always returns 2) - Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity) design choice with cross-references in both files Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587) - Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass) - Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4) - Add end-to-end sim.Run() test for BC-5 runtime length cap path - Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6) - Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap) - Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM) - Add rope_scaling parse-failure warnings for malformed HF configs (BC-8) - Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16) Fixes #580 Co-authored-by: Claude <[email protected]> * refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596) These fields became dead code after the roofline physics port (llm-optimizer single-crossover model). No runtime code path reads them; ValidateRooflineConfig enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates config-file clutter and prevents future contributors from assuming they're active. Fixes #590 Co-authored-by: Claude Opus 4.6 <[email protected]> * Configure claude on GH Actions (#600) Signed-off-by: Jing Chen <[email protected]> * Enable claude on PRs (#601) Signed-off-by: Jing Chen <[email protected]> * ignore training and actions runner (#607) Signed-off-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606) Complete 7 deferred hardening items from issue #580 (PR #587 handoff): 1. Extract applyRopeScaling as a pure function with 26 table-driven test cases covering blacklist (su/longrope/llama3), mrope fall-through, gemma3 substring match (handles text_config pivot), yarn original base, overflow guards, NaN/Inf defense, degenerate inputs. 2. Change MaxModelLen from int to int64 for consistency with ProgressIndex, TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant int64() casts, adds int64() widening at EnqueueRequest comparison sites. 3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit) and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests drain, Metrics.Requests map cleanup. 4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no spurious force-completion during multi-chunk prefill (TotalOutputTokens=49, LengthCappedRequests=0, TTFT recorded). 5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9). 6. Refine rope_scaling documentation with explicit blacklist details. 7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites model_type from "gemma3" to "gemma3_text", making the exact-match check dead code. Changed to strings.Contains to match vLLM's substring semantics. Related to #580. Discovered issues: #602, #603, #604, #605. Co-authored-by: Claude <[email protected]> * feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616) * feat(latency): register trained-roofline backend name (BC-1) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12) - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to latency model documentation - CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): comprehensive trained-roofline section in latency models guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to quickstart + document non-blocking overhead pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621) * feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4) - Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client omits budget (MaxOutputLen==0) and maxModelLen > 0 - Mirrors vLLM input_processor.py:554 safety cap - No auto-fill when client sets budget (BC-2), unlimited mode (BC-3), or input exceeds context (BC-4) Refs: #572 Co-Authored-By: Claude <[email protected]> * feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7) - generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal) - replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay) - reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning) - Matches inference-perf pattern: max_tokens = sampled output length Fixes #572 Co-Authored-By: Claude <[email protected]> * docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing Co-Authored-By: Claude <[email protected]> * docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics - Three tests referenced 'input-only check' for MaxOutputLen=0 - After auto-fill, MaxOutputLen is set to maxModelLen - input - Tests still pass numerically; comments now reflect actual behavior Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * docs: switch default example model to public Qwen/Qwen3-14B (#608) * docs: switch default example model to public qwen/qwen2.5-7b-instruct Replace gated meta-llama/llama-3.1-8b-instruct with publicly available qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart, tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel examples now work without HF authentication. Set qwen default TP=1 in defaults.yaml so examples use the default without explicit --tp flags. Update KV block count, coefficient examples, and prose references to match TP=1 values. Fixes #545 Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1) Update default and trained-coefficient vllm_version for qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: switch default example model from qwen2.5-7b to qwen3-14b Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with pre-trained coefficients already in defaults.yaml. Update all documentation examples and references accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address review comments — stale refs, tutorial throughput - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add HF_TOKEN tip to quickstart and README for gated models Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace, which fails for gated models without authentication. Add a lightweight tip after the first roofline example in both files recommending HF_TOKEN for gated model access and rate limit avoidance. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Signed-off-by: Jing Chen <[email protected]> Signed-off-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Jing Chen <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--latency-model trained-roofline) that applies learned correction factors (β₁-β₇) to analytical roofline basis functions, fitted from 137K real vLLM requests across 4 architectures via NNLSPostDecodeFixedOverhead()method to theLatencyModelinterface for correct α₁ (post-decode fixed overhead) modeling — existing backends return 0 (backward compatible)testing.AllocsPerRun)Behavioral Contracts
IsValidLatencyBackend("trained-roofline")returns trueNewLatencyModelwith Backend="trained-roofline" returns valid modeltrained_roofline_defaultsin defaults.yamlTest plan
go test ./...— all 9 packages passgolangci-lint run ./...— 0 issuestraining/basis_functions.pytraining/output/fit/coefficients.jsonto full float64 precisionTestTrainedRoofline_StepTime_ZeroAllocsviatesting.AllocsPerRunDiscovered Issues
🤖 Generated with Claude Code