Merged
Conversation
Signed-off-by: Srinivasan Parthasarathy <[email protected]>
6 tasks
susiejojo
added a commit
to susiejojo/inference-sim
that referenced
this pull request
Mar 11, 2026
- Fix "LLaMA 3.1 8B" comment in experimentation.md (issue inference-sim#3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue inference-sim#4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo
added a commit
to susiejojo/inference-sim
that referenced
this pull request
Mar 11, 2026
- Fix "LLaMA 3.1 8B" comment in experimentation.md (issue inference-sim#3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue inference-sim#4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo
added a commit
to susiejojo/inference-sim
that referenced
this pull request
Mar 11, 2026
- Fix "LLaMA 3.1 8B" comment in experimentation.md (issue inference-sim#3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue inference-sim#4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]>
sriumcp
pushed a commit
that referenced
this pull request
Mar 11, 2026
* docs: switch default example model to public qwen/qwen2.5-7b-instruct Replace gated meta-llama/llama-3.1-8b-instruct with publicly available qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart, tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel examples now work without HF authentication. Set qwen default TP=1 in defaults.yaml so examples use the default without explicit --tp flags. Update KV block count, coefficient examples, and prose references to match TP=1 values. Fixes #545 Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1) Update default and trained-coefficient vllm_version for qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: switch default example model from qwen2.5-7b to qwen3-14b Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with pre-trained coefficients already in defaults.yaml. Update all documentation examples and references accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address review comments — stale refs, tutorial throughput - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add HF_TOKEN tip to quickstart and README for gated models Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace, which fails for gated models without authentication. Add a lightweight tip after the first roofline example in both files recommending HF_TOKEN for gated model access and rate limit avoidance. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
namasl
added a commit
that referenced
this pull request
Mar 12, 2026
* feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579) Add vLLM-equivalent max_model_len enforcement at three layers: 1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks 2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542); input + MaxOutputLen > MaxModelLen rejected when client declares budget 3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth) Key design decisions: - Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens. Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles output growth. Verified by behavioral + structural grep tests. - Auto-derive from HF max_position_embeddings for roofline/crossmodel backends, with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn special-case using original_max_position_embeddings, and KV-feasible capping. - Overflow-safe ceiling division in startup validation (R11). - R3 validation at CLI (logrus.Fatalf) and constructor (panic). New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests (input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement. Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel. Blackbox gap tracked in #578. Closes: #567 Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): MoE-aware roofline latency model (#559) (#561) * feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig Two new fields for MoE-aware roofline: per-routed-expert FFN dimension and total shared-expert FFN dimension. Both default to 0 (dense model). Zero-value safe for all existing construction sites (R4 audit: all dense model configs use zero-valued MoE fields). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): parse MoE per-expert and shared-expert dims from HF config Extends GetModelConfigFromHF to parse moe_intermediate_size, shared_expert_intermediate_size, and n_shared_experts. Expert count resolution chain extended to include num_routed_experts (DeepSeek-V3). Implements BC-15 through BC-18 from the MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): add MoE consistency validation to ValidateRooflineConfig Validates: experts>0 requires active>0, active<=total, non-negative MoE dimensions. Catches inconsistent MoE configs at construction time. Implements BC-12, BC-13, BC-14 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(sim): address convergence review findings (I-1, I-2) I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size (matches HF config field name convention, consistent with other tags). I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig (R3 compliance — all numeric parameters validated). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests MoE FLOPs (Task 4): calculateTransformerFlops now computes routed (top_k), shared, and gate MLP FLOPs for MoE models. Dense models use unchanged code path (NumLocalExperts=0 guard). Active weights (Task 5): calculateMemoryAccessBytes uses top_k (active experts) for per-step weight bandwidth, matching vLLM's fused_moe kernel behavior. Includes shared expert and gate weights. Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs). Implements BC-1 through BC-6, BC-10 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation Fixes the critical bug where DeepSeek-V3's general intermediate_size (18432) was used as per-expert dim (should be 2048), overestimating MLP weights by ~9× and returning zero usable KV blocks. Changes: - KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields - NewKVCapacityParams gains 2 new positional args (R4 enforced) - computeModelWeightBytes uses per-expert dim when nonzero, falls back to IntermediateDim (Mixtral convention) - ExtractKVCapacityParams propagates new fields, extends expert count chain to include num_routed_experts (parity with GetModelConfigFromHF) Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation), BC-11 (dense unchanged). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence review round 2 — R23 parity, documentation, R15 I-1: Align expert count resolution threshold between GetModelConfigFromHF and ExtractKVCapacityParams. Both now use >1 threshold (single-expert models are dense-equivalent). Fixes R23 code path parity violation. I-2: Add precondition comments to calculateTransformerFlops and calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement. I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)). I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg). I-5: Document active vs total weight distinction in calculateMemoryAccessBytes to prevent future R23 regression. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): align MoE threshold to > 1 across all consumption paths (R23) Parsing layer already used > 1 (single-expert models are dense-equivalent). Consumption paths (calculateTransformerFlops, calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig, computeModelWeightBytes) now use > 1 as well, matching the documented design intent and resolving the R23 code path parity violation. Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1"). Round 3 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan - cmd/root.go: Replace misleading "assumes dense transformers" warning with accurate MoE info message (roofline now models per-expert FLOPs) - sim/model_hardware_config.go: Run gofmt to fix struct field alignment - docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note for threshold change (> 0 → > 1) Round 4 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * refactor(latency): port llm-optimizer single-crossover roofline physics Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover: step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth) Remove bandwidth haircut (BwEffConstant no longer used in step time). Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency). Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU 3-matrix MLP, MoE support, FlashAttention-aware memory model. Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%. The dual ceiling + bandwidth haircut + overhead stacking caused ~3x systematic over-prediction for memory-bound decode steps. Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: update MFU values to llm-optimizer defaults (0.45/0.30) MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries. These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE on the sim-to-real evaluation (discussion #522). Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged for backward compatibility — they are no longer used by rooflineStepTime() but may be consumed by other callers. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add roofline llm-optimizer port design and implementation plan Design doc: decision record for porting llm-optimizer's single-crossover roofline physics into BLIS. Implementation plan: 3 tasks (physics rewrite, MFU update, verification). Motivation: discussion #522 sim-to-real accuracy validation. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): load weights once per step in roofline (unified forward pass) vLLM chunked prefill processes all tokens (prefill + decode) in a single forward pass — weights are loaded from HBM once per step, not once per phase. The previous implementation loaded weights independently for prefill and decode phases, doubling the memory-bound term for mixed batches (~2x over-prediction). Sources: vLLM V1 blog ("all selected requests are flattened and concatenated into one long super-sequence for that single forward pass"), Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is amortized across all prompts in a batch"). Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies the overhead of adding prefill to a decode step is much less than a full weight load (7µs vs 4166µs). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both calculateTransformerFlops and calculateMemoryAccessBytes, matching llm-optimizer's formulation. For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix formula produced 31% more MLP weight bytes than llm-optimizer's 2-matrix formula, directly inflating memory-bound decode predictions. Applies to both dense and MoE paths (routed + shared expert FLOPs/weights). Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%) MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%) Motivation: after porting llm-optimizer single-crossover physics, BLIS roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed H100 tensor core utilization for large prefill GEMMs and batched decode. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump to 0.55/0.35 went the wrong direction (both models under-predict). Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the 2-matrix formula copied from llm-optimizer. SwiGLU actually has 3 weight matrices that all need HBM loading: this is the physically correct formula and increases weight bytes by ~37%, which reduces the under-prediction from ~50% toward the target. Dense and MoE paths both updated consistently (R23). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): conditional SwiGLU detection via HiddenAct field Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu) or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act field. Empty defaults to SwiGLU since most modern LLMs use it. Both calculateTransformerFlops and calculateMemoryAccessBytes now use nMat instead of hardcoded 3, correctly handling non-SwiGLU models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): revert to 2-matrix MLP convention matching llm-optimizer 3-matrix with raw intermediate_size over-predicts for models like Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU (2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach where 2 × d × intermediate ≈ physical weight count for most models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): remove MoE-specific branches from roofline step time Roofline now treats MoE models identically to dense (matching llm-optimizer which has no MoE-specific handling). MoE fields (NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still used by KV capacity (kv_capacity.go) for GPU memory budgeting. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): MoE roofline scales weights by E, FLOPs by top_k Mixtral was under-predicted by ~10x because the dense treatment loaded 1 expert's MLP weights instead of all 8. Fix: - Weight bandwidth: E × MLP weights (all experts loaded from HBM per step) - FLOPs: top_k × MLP FLOPs (only active experts compute per token) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use MoEExpertFFNDim in roofline when set For DeepSeek-V3 style models where intermediate_size (18432) differs from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and FLOP calculations. Falls back to IntermediateDim when unset (Mixtral). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR #561 review — revert crossmodel scope, fix docs - Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation: crossmodel behavioral change doesn't belong in a roofline PR) - Fix design doc table and CLI comment claiming roofline models shared experts and gate FLOPs (it doesn't — only KV capacity does) - Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs 2-matrix MLP (mlpMatrixCount always returns 2) - Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity) design choice with cross-references in both files Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587) - Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass) - Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4) - Add end-to-end sim.Run() test for BC-5 runtime length cap path - Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6) - Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap) - Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM) - Add rope_scaling parse-failure warnings for malformed HF configs (BC-8) - Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16) Fixes #580 Co-authored-by: Claude <[email protected]> * refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596) These fields became dead code after the roofline physics port (llm-optimizer single-crossover model). No runtime code path reads them; ValidateRooflineConfig enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates config-file clutter and prevents future contributors from assuming they're active. Fixes #590 Co-authored-by: Claude Opus 4.6 <[email protected]> * Configure claude on GH Actions (#600) Signed-off-by: Jing Chen <[email protected]> * Enable claude on PRs (#601) Signed-off-by: Jing Chen <[email protected]> * ignore training and actions runner (#607) Signed-off-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606) Complete 7 deferred hardening items from issue #580 (PR #587 handoff): 1. Extract applyRopeScaling as a pure function with 26 table-driven test cases covering blacklist (su/longrope/llama3), mrope fall-through, gemma3 substring match (handles text_config pivot), yarn original base, overflow guards, NaN/Inf defense, degenerate inputs. 2. Change MaxModelLen from int to int64 for consistency with ProgressIndex, TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant int64() casts, adds int64() widening at EnqueueRequest comparison sites. 3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit) and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests drain, Metrics.Requests map cleanup. 4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no spurious force-completion during multi-chunk prefill (TotalOutputTokens=49, LengthCappedRequests=0, TTFT recorded). 5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9). 6. Refine rope_scaling documentation with explicit blacklist details. 7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites model_type from "gemma3" to "gemma3_text", making the exact-match check dead code. Changed to strings.Contains to match vLLM's substring semantics. Related to #580. Discovered issues: #602, #603, #604, #605. Co-authored-by: Claude <[email protected]> * feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616) * feat(latency): register trained-roofline backend name (BC-1) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12) - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to latency model documentation - CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): comprehensive trained-roofline section in latency models guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to quickstart + document non-blocking overhead pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621) * feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4) - Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client omits budget (MaxOutputLen==0) and maxModelLen > 0 - Mirrors vLLM input_processor.py:554 safety cap - No auto-fill when client sets budget (BC-2), unlimited mode (BC-3), or input exceeds context (BC-4) Refs: #572 Co-Authored-By: Claude <[email protected]> * feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7) - generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal) - replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay) - reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning) - Matches inference-perf pattern: max_tokens = sampled output length Fixes #572 Co-Authored-By: Claude <[email protected]> * docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing Co-Authored-By: Claude <[email protected]> * docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics - Three tests referenced 'input-only check' for MaxOutputLen=0 - After auto-fill, MaxOutputLen is set to maxModelLen - input - Tests still pass numerically; comments now reflect actual behavior Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * docs: switch default example model to public Qwen/Qwen3-14B (#608) * docs: switch default example model to public qwen/qwen2.5-7b-instruct Replace gated meta-llama/llama-3.1-8b-instruct with publicly available qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart, tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel examples now work without HF authentication. Set qwen default TP=1 in defaults.yaml so examples use the default without explicit --tp flags. Update KV block count, coefficient examples, and prose references to match TP=1 values. Fixes #545 Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1) Update default and trained-coefficient vllm_version for qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: switch default example model from qwen2.5-7b to qwen3-14b Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with pre-trained coefficients already in defaults.yaml. Update all documentation examples and references accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address review comments — stale refs, tutorial throughput - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add HF_TOKEN tip to quickstart and README for gated models Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace, which fails for gated models without authentication. Add a lightweight tip after the first roofline example in both files recommending HF_TOKEN for gated model access and rate limit avoidance. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Signed-off-by: Jing Chen <[email protected]> Signed-off-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Jing Chen <[email protected]>
4 tasks
7 tasks
8 tasks
susiejojo
added a commit
that referenced
this pull request
Mar 19, 2026
…prove code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests.
sriumcp
pushed a commit
that referenced
this pull request
Mar 19, 2026
…#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
sriumcp
pushed a commit
that referenced
this pull request
Mar 20, 2026
* feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
sriumcp
added a commit
that referenced
this pull request
Mar 20, 2026
* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle Add node/GPU placement, instance lifecycle management, multi-model routing, and per-model metrics to the cluster simulator. Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore: add AGENTS.md to .gitignore - AGENTS.md is a generated file for agent context - Should not be committed to the repository * fix(cluster): prevent double-counting in drain redirect policy Fixes INV-1 conservation violation when DrainRedirect policy re-injects queued requests. Previously, redirected requests were counted twice in CompletedRequests: once when initially injected and again when completed after redirection. Changes: - Add Request.Redirected field to track re-injected requests - Mark requests as redirected in drainRedirect.Drain() before re-injection - Skip CompletedRequests increment in recordRequestCompletion() for redirected requests - Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix The fix ensures INV-1 (request conservation) holds: injected = completed + queued + running + dropped + timed_out Addresses PR #697 review feedback on drain redirect conservation. * fix: address PR #697 review comments - Document DrainPolicy as Phase 1C infrastructure - Fix warm-up request overcounting by checking warmUpRemaining - Fix drain callback memory leak in MarkNodeTerminated - Add named constants for lifecycle event priorities - Clarify drainWait GPU release and swap-remove logic - Add .bob/ to .gitignore * Fix warm-up TTFT penalty implementation - Initialize warmUpRemaining for all instances in backward-compat mode - Fix indentation in cluster_event.go - Update warm-up recording to track first N requests - Adjust test expectations to account for queueing effects Fixes build and test failures in PR. * chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory * feat(latency): decouple quantized weight precision from compute dtype (#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * feat(sim): add L40S GPU and FP8 compute support (#765) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics - sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++. The guard caused redirected requests to vanish from INV-1 accounting: source's InjectedRequests=0 (drained from WaitQ before completion), destination's InjectedRequests=0 (skipped CompletedRequests). Destination is the sole completion site so incrementing there preserves conservation. - cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition. Clusters where all failures are routing rejections (no routable instances) silently omitted the anomaly summary block (I3 from sriumcp review). - sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter explicit (non-variadic). Prevents call sites from silently passing 0 and missing routing rejections. Updated all test call sites to pass 0 explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * revert(ci): remove skip-cache from golangci-lint step Unrelated to Phase 1A changes. skip-cache: true was added during development but should not be merged to main. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests - sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to actually seed requests into inst0's WaitQ before drain. Previous version used a manual event loop on an empty queue — DrainWaitQueue() returned [] and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest directly + empty workload so Run() doesn't push duplicate arrivals. Also adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and clusterEvents non-empty after drain. - sim/request.go: Update Redirected field comment to reflect actual behavior. Previous comment said "completion accounting is skipped" — opposite of what simulator.go:recordRequestCompletion now does. - sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry for redirected requests before re-injection. Source registered the request at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry. Destination re-registers on re-enqueue, causing a spurious "duplicate request ID" WARN in aggregateMetrics() for every redirected request. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: address sriumcp documentation gaps (GAP 1-4) GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML schema to the Policy Bundle section so users can discover and configure Phase 1A features. Both are YAML-only (no CLI flags). Add a note block explaining backward compatibility. Update DeploymentConfig row in the config-to-flag mapping table to note the YAML-only fields. GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to "Rejected Requests (Admission)" to match actual CLI output label. Add new "Rejected Requests (Routing)" row explaining when it fires (no routable instances — all Loading/Draining) and the remediation action. GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the example output block to match printPerModelMetrics which uses m.TTFT.P50. Add tok/s to the Throughput example line to match actual output format. GAP 4 — results.md per_model JSON: Add table documenting the per_model key in --results-path JSON output (omitted when no model tags present), with field-by-field description. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --------- Co-authored-by: tantawi <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>
namasl
added a commit
to namasl/inference-sim
that referenced
this pull request
Mar 23, 2026
…pagation, graceful rejection Critical #1: buildPoolFilteredSnapshots now filters by IsRoutable() for parity with buildRouterState (R23). Prevents routing to Loading/Draining/ Terminated instances when PD disaggregation combines with Phase 1A lifecycle. Critical #2: Propagate Deadline from original request to both prefill and decode sub-requests. EnqueueDecodeSubRequest now schedules TimeoutEvent when Deadline is set (R23: parity with EnqueueRequest). Also copy MaxOutputLen and PrefixGroup to prefill sub-request for field completeness. Important inference-sim#3: Log warning when pdInTransfer is negative, mirroring the existing inFlightRequests negative-check pattern. Surfaces bookkeeping bugs instead of silently swallowing conservation gaps. Important inference-sim#4: Replace empty-pool panics in PrefillRoutingEvent and DecodeRoutingEvent with graceful rejection (warn + routingRejections++ or droppedAtDecodeKV++). Now that IsRoutable() filtering can produce empty pools at runtime, these are no longer pure programming errors. Important inference-sim#7: Qualify INV-PD-3 statement for bounded horizons — initiated can exceed completed when transfers are in-flight at horizon. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
12 tasks
sriumcp
pushed a commit
that referenced
this pull request
Mar 23, 2026
…st flow (PR2) (#805) * feat(sim/cluster): PD disaggregation — end-to-end disaggregated request flow (PR2) Working PD pipeline. After this PR, users can run: blis run --prefill-instances 2 --decode-instances 2 --pd-decider always Requests flow through the full prefill → KV transfer → decode lifecycle. When pools are not configured (default), behavior is identical to today. New files: - sim/cluster/pd_events.go: PrefillRoutingEvent (priority 4), KVTransferStartedEvent (5), KVTransferCompletedEvent (6), DecodeRoutingEvent (7). Simple bandwidth calculation. - sim/cluster/disaggregation_test.go: Integration tests — always-disaggregate E2E, pool conservation, prefill-to-decode lifecycle, phase causality, transfer conservation, determinism, backward compatibility, per-pool scorers. - examples/pd-disaggregation-demo.yaml: Annotated demo configuration. Modified files: - sim/cluster/deployment.go: PD config fields (transfer bandwidth, base latency, KV bytes per token, per-pool scorer configs). - sim/cluster/cluster.go: PD state (poolMembership, disaggregationDecider, parentRequests, pendingPrefillCompletions, pendingDecodeCompletions, transfer counters, per-pool routing policies). Updated constructor and Run() with prefill/decode completion detection. - sim/cluster/cluster_event.go: DisaggregationDecisionEvent (priority 3). Bifurcated AdmissionDecisionEvent for pool-configured clusters. - sim/cluster/instance.go: AllocateTransferredKV(), InjectDecodeOnline(). - sim/simulator.go: EnqueueDecodeSubRequest() — bypasses guards for pre-allocated KV, work-conserving (INV-8). - sim/batch_formation.go: Decode-only batch path in FormBatch() Phase 2. - cmd/root.go: CLI flags for PD configuration. - docs/contributing/standards/invariants.md: INV-PD-1 through INV-PD-4. Part of #793. Depends on PR1 (#794, merged). Closes #795. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(lint): remove redundant nil check before len() on map staticcheck S1009: len() for nil maps is defined as zero. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review — INV-1 conservation, R8, dead code, invariant doc C1 (Critical): Fix CompletedRequests double-counting in disaggregated mode. Each disaggregated request produces two sub-requests (prefill + decode) that complete on separate instances. aggregateMetrics() naively sums completions, yielding 2N for N requests. Track pdPrefillCompletedCount and subtract after aggregation to restore correct user-visible completion count. C2 (Critical): Add TestDisaggregation_INV1Conservation to verify CompletedRequests == N and full conservation equation in disaggregated mode. I1 (Important): ParentRequests() now returns a defensive copy of the map (R8: no exported mutable maps), matching PoolMembership() pattern. I2 (Important): Replace structurally dead INV-PD-1 defensive check with a comment documenting the structural guarantee. DecodeEnqueueTime and TransferCompleteTime are both set from the same event timestamp by construction (KVTransferCompletedEvent schedules DecodeRoutingEvent at e.time), so the inequality can never fire. I3 (Important): Add INV-PD-5 (Pool Stability) to invariants.md — was referenced in test but absent from the canonical invariant document. I4 (Minor): Remove unused eventTime parameter from InjectDecodeOnline. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address automated review — 5 issues (conservation, R8, INV-6, tests, MaxOutputLen) 1. (Critical 93) INV-1 at bounded horizon: track pdDecodeCompletedCount and compute pdInFlight = prefillCompleted - decodeCompleted - droppedAtDecodeKV. Add pdInFlight to StillRunning so requests mid-transfer aren't lost from conservation accounting. Added TestDisaggregation_INV1Conservation_BoundedHorizon. 2. (Critical 92) R8 deep copy: ParentRequests() now copies each ParentRequest struct (`cp := *v`) so callers cannot mutate internal lifecycle timestamps. 3. (Important 85) R2/INV-6 determinism: detectPrefillCompletions and detectDecodeCompletions now collect completed IDs into a sorted slice before processing, ensuring deterministic nextSeqID() assignment regardless of Go's random map iteration order. 4. (Important 82) Decode-only batch KV pressure test: TestDisaggregation_DecodeOnlyBatchKVPressure exercises the decode-only batch path under tight KV cache (50 blocks), verifying INV-1 conservation holds when decode-only requests cannot allocate KV. 5. (Important 80) MaxOutputLen parity: decode sub-request now copies MaxOutputLen from the original request, maintaining R23 parallel code path transformation parity with EnqueueRequest. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address re-review — pdInTransfer double-count, decode KV drop test, R8 comment F1 (Critical): Fix pdInFlight double-counting at bounded horizon. When a decode sub-request has been injected into an instance but not yet completed, it appears in the instance's StillQueued/StillRunning via Finalize(). The old formula (pdPrefillCompleted - pdDecodeCompleted - droppedAtDecodeKV) counted these requests as in-transfer, adding them to StillRunning a second time. Fix: subtract len(pendingDecodeCompletions) which tracks decode sub-requests already on instances but not yet completed. F2 (Important): Update ParentRequests() comment to accurately note that OriginalRequest is a shared pointer. Callers must not mutate via it. F3 (Important): Add TestDisaggregation_DroppedAtDecodeKV that actually triggers the droppedAtDecodeKV path. Uses 1 decode instance with 3 KV blocks (48 tokens) and 20-token requests — second concurrent transfer fails AllocateTransferredKV. Verifies droppedAtDecodeKV > 0 and INV-1 conservation holds. F5 (Minor): Replace hardcoded count 5 with len(requests) in TestDisaggregation_TransferConservation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review — IsRoutable filter, Deadline propagation, graceful rejection Critical #1: buildPoolFilteredSnapshots now filters by IsRoutable() for parity with buildRouterState (R23). Prevents routing to Loading/Draining/ Terminated instances when PD disaggregation combines with Phase 1A lifecycle. Critical #2: Propagate Deadline from original request to both prefill and decode sub-requests. EnqueueDecodeSubRequest now schedules TimeoutEvent when Deadline is set (R23: parity with EnqueueRequest). Also copy MaxOutputLen and PrefixGroup to prefill sub-request for field completeness. Important #3: Log warning when pdInTransfer is negative, mirroring the existing inFlightRequests negative-check pattern. Surfaces bookkeeping bugs instead of silently swallowing conservation gaps. Important #4: Replace empty-pool panics in PrefillRoutingEvent and DecodeRoutingEvent with graceful rejection (warn + routingRejections++ or droppedAtDecodeKV++). Now that IsRoutable() filtering can produce empty pools at runtime, these are no longer pure programming errors. Important #7: Qualify INV-PD-3 statement for bounded horizons — initiated can exceed completed when transfers are in-flight at horizon. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address re-review NA-1/NA-2 — model filter comment, conservation equation NA-1: Update buildPoolFilteredSnapshots comment to document that model filter is intentionally omitted (all instances in DeploymentConfig share config.Model). Notes where to add it if multi-model PD clusters are added. NA-2: Fix INV-1 conservation assertions to include TimedOutRequests per the canonical INV-1 definition. Extract assertINV1Conservation helper used by all 5 conservation tests. Document in pdInTransfer comment that timed-out prefill sub-requests are already counted in instance TimedOutRequests and need no correction. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Barebones readme.