Skip to content

feat(latency): add trained-roofline backend with roofline basis functions × learned corrections#616

Merged
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:trained-roofline-backend
Mar 11, 2026
Merged

feat(latency): add trained-roofline backend with roofline basis functions × learned corrections#616
sriumcp merged 11 commits intoinference-sim:mainfrom
sriumcp:trained-roofline-backend

Conversation

@sriumcp
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp commented Mar 11, 2026

Summary

  • Adds a 4th latency model backend (--latency-model trained-roofline) that applies learned correction factors (β₁-β₇) to analytical roofline basis functions, fitted from 137K real vLLM requests across 4 architectures via NNLS
  • Adds PostDecodeFixedOverhead() method to the LatencyModel interface for correct α₁ (post-decode fixed overhead) modeling — existing backends return 0 (backward compatible)
  • 7% MAPE on GPU combined step time (test split) across Llama-2-7b, Llama-2-70b, Mixtral-8x7B, CodeLlama-34b
  • Zero heap allocations in StepTime (19ns/op, 0 allocs/op verified by testing.AllocsPerRun)

Behavioral Contracts

  • BC-1: IsValidLatencyBackend("trained-roofline") returns true
  • BC-2: NewLatencyModel with Backend="trained-roofline" returns valid model
  • BC-3: StepTime = β₁·max(T_pf_compute, T_pf_kv) + β₂·max(T_dc_compute, T_dc_kv) + β₃·T_weight + β₄·T_tp + β₅·L + β₆·batchSize + β₇
  • BC-4/5: Prefill/decode monotonicity (more tokens → non-decreasing step time)
  • BC-6: StepTime ≥ 1 for all inputs (INV-3)
  • BC-7: QueueingTime = α₀ (constant API processing overhead)
  • BC-8: OutputTokenProcessingTime = α₂ (per-token detokenization)
  • BC-9: MoE weight loading uses min(N, max(k, B·k)) effective experts
  • BC-10: Coefficients loaded from trained_roofline_defaults in defaults.yaml
  • BC-11: No MFU scaling (β₁/β₂ ARE the corrections)
  • BC-12: Existing backends byte-identical (backward compatible)
  • BC-13/14: Coefficient length and config validation with descriptive errors
  • BC-15: PostDecodeFixedOverhead = α₁ (fixed per-request post-decode overhead)

Test plan

  • go test ./... — all 9 packages pass
  • golangci-lint run ./... — 0 issues
  • Feature fidelity: all 6 basis functions verified term-by-term against training/basis_functions.py
  • Coefficient values match training/output/fit/coefficients.json to full float64 precision
  • Zero-allocation enforcement: TestTrainedRoofline_StepTime_ZeroAllocs via testing.AllocsPerRun
  • Benchmark: 19ns/op, 0 allocs/op
  • Plan convergence: 4 rounds
  • Code convergence: 2 rounds (10 perspectives each)
  • Pre-commit self-audit: all 10 dimensions

Discovered Issues

🤖 Generated with Claude Code

sriumcp and others added 11 commits March 10, 2026 23:20
…rainedRooflineLatencyModel (BC-3,6,7,8,9,11,15)

- Add PostDecodeFixedOverhead() int64 to LatencyModel interface
- Existing backends (blackbox, roofline, crossmodel) return 0
- Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E
- TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections
- Zero heap allocations in StepTime (19ns/op, 0 allocs/op)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…,13,14)

- Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim,
  TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients
- Derives architecture features at construction: headDim, dKV, dFF, kEff
- Table-driven error tests for all validation paths

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… (BC-4,5)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas
- Add TrainedRooflineDefaults struct to cmd/default_config.go
- CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard,
  HFConfig parsing, help text)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- CLAUDE.md: "Four modes", file tree, Key Data Flow
- sim/config.go: Backend field comment
- sim/latency/latency.go: package doc
- docs/concepts/core-engine.md: "four latency model backends"
- docs/concepts/glossary.md: "Four modes" + trained-roofline description
- Plan committed alongside implementation

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…x ITL contamination

- PostDecodeFixedOverhead only applied when len(OutputTokens) > 0
- RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to
  avoid contaminating per-token average ITL with fixed overhead
- Add zero-alpha warning for trained-roofline CLI path

Caught by code review Step 4.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… guide

- Add trained-roofline section with formula, alpha model, accuracy caveats
- Update comparison table to 4 backends
- Update recommendation: trained-roofline is now the default for new models
- Update pluggable architecture to show 4 interface methods
- Fix cross-model description accuracy

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…lloc test, config ref

- Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points,
  added trained-roofline as 4th example
- Factory: defensive copy of beta/alpha slices to enforce "frozen" contract
- Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun
- Configuration reference: add trained-roofline to --latency-model flag description

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…head pattern

- Quickstart: add trained-roofline example (recommended for new models)
- recordRequestCompletion: document that E2E includes non-blocking
  PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why
  RequestCompletionTimes exceeds RequestLeftEvent timestamp

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ined-roofline

All documentation working copies now mention trained-roofline consistently.
Source-of-truth map: 12/12 working copies updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@sriumcp sriumcp force-pushed the trained-roofline-backend branch from b10ad79 to 88c361a Compare March 11, 2026 03:24
@sriumcp sriumcp merged commit 36c8325 into inference-sim:main Mar 11, 2026
4 checks passed
namasl added a commit to namasl/inference-sim that referenced this pull request Mar 12, 2026
Merges 11 commits from main into pd, including:
- trained-roofline latency backend (inference-sim#616)
- MaxOutputLen population + engine auto-fill (inference-sim#621)
- MaxModelLen int64 + rope_scaling extraction (inference-sim#606)
- MoE-aware roofline latency (inference-sim#561)
- MaxModelLen enforcement + oracle knowledge boundary (inference-sim#579, inference-sim#587)
- Dead HardwareCalib fields removal (inference-sim#596)
- Default example model switch to Qwen3-14B (inference-sim#608)
- CI/CD updates (inference-sim#600, inference-sim#601, inference-sim#607)

Conflict resolutions:
- CLAUDE.md: kept both INV-9 (main) and INV-PD-* (pd) invariants
- cmd/root.go: merged LengthCappedRequests counter + DroppedKVAllocations
- sim/bundle.go: added trained-roofline backend + kept disaggregation deciders
- sim/cluster/metrics.go: added LengthCappedRequests field
- docs: merged invariants and results documentation from both branches
- sim/cluster/disaggregation_test.go: added MaxModelLen param to NewModelHardwareConfig calls
- sim/cluster/cluster_event.go: rewrote comment to avoid INV-9 test false positive

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
namasl added a commit that referenced this pull request Mar 12, 2026
* feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579)

Add vLLM-equivalent max_model_len enforcement at three layers:

1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks
2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542);
   input + MaxOutputLen > MaxModelLen rejected when client declares budget
3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth)

Key design decisions:
- Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens.
  Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles
  output growth. Verified by behavioral + structural grep tests.
- Auto-derive from HF max_position_embeddings for roofline/crossmodel backends,
  with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn
  special-case using original_max_position_embeddings, and KV-feasible capping.
- Overflow-safe ceiling division in startup validation (R11).
- R3 validation at CLI (logrus.Fatalf) and constructor (panic).

New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests
(input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement.

Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel.
Blackbox gap tracked in #578.

Closes: #567

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): MoE-aware roofline latency model (#559) (#561)

* feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig

Two new fields for MoE-aware roofline: per-routed-expert FFN dimension
and total shared-expert FFN dimension. Both default to 0 (dense model).
Zero-value safe for all existing construction sites (R4 audit: all
dense model configs use zero-valued MoE fields).

Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): parse MoE per-expert and shared-expert dims from HF config

Extends GetModelConfigFromHF to parse moe_intermediate_size,
shared_expert_intermediate_size, and n_shared_experts. Expert count
resolution chain extended to include num_routed_experts (DeepSeek-V3).

Implements BC-15 through BC-18 from the MoE roofline design.
Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): add MoE consistency validation to ValidateRooflineConfig

Validates: experts>0 requires active>0, active<=total, non-negative
MoE dimensions. Catches inconsistent MoE configs at construction time.

Implements BC-12, BC-13, BC-14 from MoE roofline design.
Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(sim): address convergence review findings (I-1, I-2)

I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size
     (matches HF config field name convention, consistent with other tags).
I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig
     (R3 compliance — all numeric parameters validated).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests

MoE FLOPs (Task 4): calculateTransformerFlops now computes routed
(top_k), shared, and gate MLP FLOPs for MoE models. Dense models
use unchanged code path (NumLocalExperts=0 guard).

Active weights (Task 5): calculateMemoryAccessBytes uses top_k
(active experts) for per-step weight bandwidth, matching vLLM's
fused_moe kernel behavior. Includes shared expert and gate weights.

Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke
tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs).

Implements BC-1 through BC-6, BC-10 from MoE roofline design.
Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation

Fixes the critical bug where DeepSeek-V3's general intermediate_size
(18432) was used as per-expert dim (should be 2048), overestimating
MLP weights by ~9× and returning zero usable KV blocks.

Changes:
- KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields
- NewKVCapacityParams gains 2 new positional args (R4 enforced)
- computeModelWeightBytes uses per-expert dim when nonzero, falls back
  to IntermediateDim (Mixtral convention)
- ExtractKVCapacityParams propagates new fields, extends expert count
  chain to include num_routed_experts (parity with GetModelConfigFromHF)

Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation),
BC-11 (dense unchanged). Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence review round 2 — R23 parity, documentation, R15

I-1: Align expert count resolution threshold between GetModelConfigFromHF
and ExtractKVCapacityParams. Both now use >1 threshold (single-expert
models are dense-equivalent). Fixes R23 code path parity violation.

I-2: Add precondition comments to calculateTransformerFlops and
calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement.

I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to
SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)).

I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and
pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg).

I-5: Document active vs total weight distinction in calculateMemoryAccessBytes
to prevent future R23 regression.

Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): align MoE threshold to > 1 across all consumption paths (R23)

Parsing layer already used > 1 (single-expert models are
dense-equivalent). Consumption paths (calculateTransformerFlops,
calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig,
computeModelWeightBytes) now use > 1 as well, matching the documented
design intent and resolving the R23 code path parity violation.

Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1").

Round 3 convergence review fixes.
Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan

- cmd/root.go: Replace misleading "assumes dense transformers" warning
  with accurate MoE info message (roofline now models per-expert FLOPs)
- sim/model_hardware_config.go: Run gofmt to fix struct field alignment
- docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note
  for threshold change (> 0 → > 1)

Round 4 convergence review fixes.
Part of #559

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* refactor(latency): port llm-optimizer single-crossover roofline physics

Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover:
  step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth)

Remove bandwidth haircut (BwEffConstant no longer used in step time).
Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency).

Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU
3-matrix MLP, MoE support, FlashAttention-aware memory model.

Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%.
The dual ceiling + bandwidth haircut + overhead stacking caused ~3x
systematic over-prediction for memory-bound decode steps.

Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* config: update MFU values to llm-optimizer defaults (0.45/0.30)

MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries.
These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE
on the sim-to-real evaluation (discussion #522).

Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged
for backward compatibility — they are no longer used by rooflineStepTime()
but may be consumed by other callers.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: add roofline llm-optimizer port design and implementation plan

Design doc: decision record for porting llm-optimizer's single-crossover
roofline physics into BLIS.
Implementation plan: 3 tasks (physics rewrite, MFU update, verification).

Motivation: discussion #522 sim-to-real accuracy validation.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): load weights once per step in roofline (unified forward pass)

vLLM chunked prefill processes all tokens (prefill + decode) in a single
forward pass — weights are loaded from HBM once per step, not once per
phase. The previous implementation loaded weights independently for
prefill and decode phases, doubling the memory-bound term for mixed
batches (~2x over-prediction).

Sources: vLLM V1 blog ("all selected requests are flattened and
concatenated into one long super-sequence for that single forward pass"),
Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is
amortized across all prompts in a batch").

Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies
the overhead of adding prefill to a decode step is much less than a
full weight load (7µs vs 4166µs).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation

Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both
calculateTransformerFlops and calculateMemoryAccessBytes, matching
llm-optimizer's formulation.

For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix
formula produced 31% more MLP weight bytes than llm-optimizer's
2-matrix formula, directly inflating memory-bound decode predictions.

Applies to both dense and MoE paths (routed + shared expert FLOPs/weights).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction

MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%)
MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%)

Motivation: after porting llm-optimizer single-crossover physics, BLIS
roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed
H100 tensor core utilization for large prefill GEMMs and batched decode.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump

Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump
to 0.55/0.35 went the wrong direction (both models under-predict).

Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the
2-matrix formula copied from llm-optimizer. SwiGLU actually has 3
weight matrices that all need HBM loading: this is the physically
correct formula and increases weight bytes by ~37%, which reduces
the under-prediction from ~50% toward the target.

Dense and MoE paths both updated consistently (R23).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): conditional SwiGLU detection via HiddenAct field

Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu)
or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act
field. Empty defaults to SwiGLU since most modern LLMs use it.

Both calculateTransformerFlops and calculateMemoryAccessBytes now use
nMat instead of hardcoded 3, correctly handling non-SwiGLU models.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): revert to 2-matrix MLP convention matching llm-optimizer

3-matrix with raw intermediate_size over-predicts for models like
Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU
(2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach
where 2 × d × intermediate ≈ physical weight count for most models.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): remove MoE-specific branches from roofline step time

Roofline now treats MoE models identically to dense (matching
llm-optimizer which has no MoE-specific handling). MoE fields
(NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still
used by KV capacity (kv_capacity.go) for GPU memory budgeting.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): MoE roofline scales weights by E, FLOPs by top_k

Mixtral was under-predicted by ~10x because the dense treatment loaded
1 expert's MLP weights instead of all 8. Fix:
- Weight bandwidth: E × MLP weights (all experts loaded from HBM per step)
- FLOPs: top_k × MLP FLOPs (only active experts compute per token)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): use MoEExpertFFNDim in roofline when set

For DeepSeek-V3 style models where intermediate_size (18432) differs
from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and
FLOP calculations. Falls back to IntermediateDim when unset (Mixtral).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR #561 review — revert crossmodel scope, fix docs

- Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation:
  crossmodel behavioral change doesn't belong in a roofline PR)
- Fix design doc table and CLI comment claiming roofline models shared
  experts and gate FLOPs (it doesn't — only KV capacity does)
- Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs
  2-matrix MLP (mlpMatrixCount always returns 2)
- Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity)
  design choice with cross-references in both files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>

* fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587)

- Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass)
- Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4)
- Add end-to-end sim.Run() test for BC-5 runtime length cap path
- Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6)
- Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap)
- Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM)
- Add rope_scaling parse-failure warnings for malformed HF configs (BC-8)
- Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16)

Fixes #580

Co-authored-by: Claude <[email protected]>

* refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596)

These fields became dead code after the roofline physics port (llm-optimizer
single-crossover model). No runtime code path reads them; ValidateRooflineConfig
enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates
config-file clutter and prevents future contributors from assuming they're active.

Fixes #590

Co-authored-by: Claude Opus 4.6 <[email protected]>

* Configure claude on  GH Actions (#600)

Signed-off-by: Jing Chen <[email protected]>

* Enable claude on PRs (#601)

Signed-off-by: Jing Chen <[email protected]>

* ignore training and actions runner (#607)

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

* fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606)

Complete 7 deferred hardening items from issue #580 (PR #587 handoff):

1. Extract applyRopeScaling as a pure function with 26 table-driven test
   cases covering blacklist (su/longrope/llama3), mrope fall-through,
   gemma3 substring match (handles text_config pivot), yarn original base,
   overflow guards, NaN/Inf defense, degenerate inputs.

2. Change MaxModelLen from int to int64 for consistency with ProgressIndex,
   TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant
   int64() casts, adds int64() widening at EnqueueRequest comparison sites.

3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit)
   and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests
   drain, Metrics.Requests map cleanup.

4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no
   spurious force-completion during multi-chunk prefill (TotalOutputTokens=49,
   LengthCappedRequests=0, TTFT recorded).

5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9).

6. Refine rope_scaling documentation with explicit blacklist details.

7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites
   model_type from "gemma3" to "gemma3_text", making the exact-match check
   dead code. Changed to strings.Contains to match vLLM's substring semantics.

Related to #580. Discovered issues: #602, #603, #604, #605.

Co-authored-by: Claude <[email protected]>

* feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616)

* feat(latency): register trained-roofline backend name (BC-1)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15)

- Add PostDecodeFixedOverhead() int64 to LatencyModel interface
- Existing backends (blackbox, roofline, crossmodel) return 0
- Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E
- TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections
- Zero heap allocations in StepTime (19ns/op, 0 allocs/op)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14)

- Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim,
  TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients
- Derives architecture features at construction: headDim, dKV, dFF, kEff
- Table-driven error tests for all validation paths

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* feat(latency): add trained-roofline defaults + CLI loading (BC-10,12)

- Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas
- Add TrainedRooflineDefaults struct to cmd/default_config.go
- CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard,
  HFConfig parsing, help text)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add trained-roofline to latency model documentation

- CLAUDE.md: "Four modes", file tree, Key Data Flow
- sim/config.go: Backend field comment
- sim/latency/latency.go: package doc
- docs/concepts/core-engine.md: "four latency model backends"
- docs/concepts/glossary.md: "Four modes" + trained-roofline description
- Plan committed alongside implementation

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination

- PostDecodeFixedOverhead only applied when len(OutputTokens) > 0
- RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to
  avoid contaminating per-token average ITL with fixed overhead
- Add zero-alpha warning for trained-roofline CLI path

Caught by code review Step 4.5.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs(guide): comprehensive trained-roofline section in latency models guide

- Add trained-roofline section with formula, alpha model, accuracy caveats
- Update comparison table to 4 backends
- Update recommendation: trained-roofline is now the default for new models
- Update pluggable architecture to show 4 interface methods
- Fix cross-model description accuracy

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref

- Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points,
  added trained-roofline as 4th example
- Factory: defensive copy of beta/alpha slices to enforce "frozen" contract
- Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun
- Configuration reference: add trained-roofline to --latency-model flag description

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: add trained-roofline to quickstart + document non-blocking overhead pattern

- Quickstart: add trained-roofline example (recommended for new models)
- recordRequestCompletion: document that E2E includes non-blocking
  PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why
  RequestCompletionTimes exceeds RequestLeftEvent timestamp

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

* docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline

All documentation working copies now mention trained-roofline consistently.
Source-of-truth map: 12/12 working copies updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621)

* feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4)

- Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client
  omits budget (MaxOutputLen==0) and maxModelLen > 0
- Mirrors vLLM input_processor.py:554 safety cap
- No auto-fill when client sets budget (BC-2), unlimited mode (BC-3),
  or input exceeds context (BC-4)

Refs: #572

Co-Authored-By: Claude <[email protected]>

* feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7)

- generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal)
- replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay)
- reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning)
- Matches inference-perf pattern: max_tokens = sampled output length

Fixes #572

Co-Authored-By: Claude <[email protected]>

* docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing

Co-Authored-By: Claude <[email protected]>

* docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics

- Three tests referenced 'input-only check' for MaxOutputLen=0
- After auto-fill, MaxOutputLen is set to maxModelLen - input
- Tests still pass numerically; comments now reflect actual behavior

Co-Authored-By: Claude <[email protected]>

---------

Co-authored-by: Claude <[email protected]>

* docs: switch default example model to public Qwen/Qwen3-14B (#608)

* docs: switch default example model to public qwen/qwen2.5-7b-instruct

Replace gated meta-llama/llama-3.1-8b-instruct with publicly available
qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart,
tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel
examples now work without HF authentication.

Set qwen default TP=1 in defaults.yaml so examples use the default without
explicit --tp flags. Update KV block count, coefficient examples, and prose
references to match TP=1 values.

Fixes #545

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1)

Update default and trained-coefficient vllm_version for
qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and
qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: switch default example model from qwen2.5-7b to qwen3-14b

Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with
pre-trained coefficients already in defaults.yaml. Update all
documentation examples and references accordingly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address review comments — stale refs, tutorial throughput

- Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3)
- Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4)
- Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per
  instance, target 20 req/s (was 57 req/s / 500 req/s for llama)
- Scale experimentation.md example to match (20 req/s, not 400)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: add HF_TOKEN tip to quickstart and README for gated models

Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace,
which fails for gated models without authentication. Add a lightweight
tip after the first roofline example in both files recommending HF_TOKEN
for gated model access and rate limit avoidance.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

---------

Signed-off-by: Jing Chen <[email protected]>
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>
Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Dipanwita Guhathakurta <[email protected]>
Co-authored-by: Jing Chen <[email protected]>
@sriumcp sriumcp deleted the trained-roofline-backend branch March 19, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant