Simplify kv cache (by removing matched variable) by sriumcp · Pull Request #5 · inference-sim/inference-sim

sriumcp · 2025-06-13T17:13:48Z

No description provided.

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

sim/kvcache.go

sim/simulator.go

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

H2 (fixed additive scheduling overhead) refuted at InferSim values: TPOT worsened by 11.5pp. The residual after H1 scales with model size (~2ms for 7B, ~3ms for 70B), not constant. InferSim's 5ms/30ms values are calibrated to InferSim's own modeling gaps, not actual vLLM overhead. Updates: H2 section with experiment results and revised recommendation, Tier 0 table, execution order, overfitting risk note, Section 3.3 theoretical grounding, and limitation #5. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove H2 findings injected into the TAM plan document. The TAM doc should contain only the research plan; experiment results belong in hypotheses/h-roofline/h2-scheduling-overhead/FINDINGS.md. Reverted 6 locations: H2 section header/body, mechanism evidence bullet, Tier 0 table row, overfitting risk note, Section 3.3, and limitation #5. Co-Authored-By: Claude Opus 4.5 <[email protected]>

…uctural tests Critical: Add CLI validation that --pd-decider other than "never" requires --prefill-instances and --decode-instances to be set (R1: no silent config failure). Important: - Remove superseded disaggregation decider recipe (incorrect field names, wrong flag names, stale purity contract) from extension-recipes.md - Add prefix-threshold to first PD flags table in configuration.md - Replace internal field accesses (cs.parentRequests, cs.transfersInitiated, cs.trace, cs.droppedAtDecodeKV) with public accessors (cs.ParentRequests(), cs.Trace(), cs.DroppedKVAllocations()) in disaggregation_test.go for refactor survival (BDD/TDD principle inference-sim#5) Also includes minor comment condensations from code-simplifier agent. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ion (PR5) (#620) * feat(sim): add PrefixThresholdDecider and DisaggregationObserver Adds PrefixThresholdDecider: disaggregates when non-cached token count exceeds threshold. Maintains a router-side PrefixCacheIndex under a single globalVirtualInstance key to track cluster-wide prefix knowledge. Also adds DisaggregationObserver interface for stateful deciders that learn from routing decisions (ObserveRouting called synchronously by ClusterSimulator after each routing decision). Implements cachedHashes/cachedReqID pattern (mirrors routing_prefix_scorer.go) to avoid double-hashing between Decide() and ObserveRouting(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim): register prefix-threshold in validDisaggregationDeciders Adds "prefix-threshold" to the valid disaggregation decider names in bundle.go, enabling IsValidDisaggregationDecider() and ValidDisaggregationDeciderNames() to recognize the new decider. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim/cluster,cmd): add PDPrefixThreshold config field and CLI flag Adds PDPrefixThreshold int to DeploymentConfig for the prefix-threshold decider's non-cached token threshold. Adds --pd-prefix-threshold CLI flag (default 512) with >= 0 validation. Updates --pd-decider description to include "prefix-threshold". Wires PDPrefixThreshold into the config construction in cmd/root.go. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim/cluster): wire PrefixThresholdDecider and DisaggregationObserver - cluster.go: branch on PDDecider=="prefix-threshold" to construct PrefixThresholdDecider(PDPrefixThreshold, BlockSizeTokens) directly; add notifyDisaggregationObserver helper - cluster_event.go: call notifyDisaggregationObserver after RoutingDecisionEvent injection (standard routing path, BC-PD-28) - pd_events.go: call notifyDisaggregationObserver after PrefillRoutingEvent injection (disaggregated path, BC-PD-28) - disaggregation_test.go: add integration tests verifying wiring, high/low threshold behavior, observer call, and transfer conservation Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * chore: update CLAUDE.md for PrefixThresholdDecider and --pd-prefix-threshold Documents DisaggregationObserver interface, PrefixThresholdDecider, and PDPrefixThreshold in the file organization table. Adds --pd-prefix-threshold to the CLI flags list and the disaggregated data flow CLI flags summary. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review fixes for prefix-threshold decider PR Round 1 fixes: - Update DisaggregationObserver docstring with R13 note and R17 signal freshness guarantee - Add noopDisaggregationObserver to satisfy R13 (>=2 implementations) - Add explicit comment in DecodeRoutingEvent explaining why observer is intentionally not called on the decode sub-path - Add 'Adding New Disaggregation Deciders' section to extension-recipes.md covering both stateless and stateful patterns - Fix Decide() docstring: clarify hash-reuse fires only on non-disaggregated path (disaggregated path receives prefill sub-request with different ID) Round 2 fixes: - Remove structural TestPrefixThreshold_DeciderWiredCorrectly; replace with compile-time interface assertion (behavioral coverage exists in ZeroThresholdAlwaysDisaggregates and HighThresholdNoDisaggregation) - Add PDPrefixThreshold field to newTestDisaggDeploymentConfig (R4 construction site audit) - Add PD disaggregation row to CLI Flag Summary table in configuration.md including --pd-prefix-threshold and all PD flags introduced in PR1-PR4 Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review fixes for prefix-threshold decider PR (Round 3) - Add logrus.Warn when --pd-prefix-threshold is explicitly set but --pd-decider is not "prefix-threshold" (silent flag ignored, R3 spirit) - Add PD Disaggregation section to docs/reference/configuration.md explaining decider options, prefix-threshold semantics (non-cached tokens vs total tokens), default value meaning, and all PD flags Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim/cluster): strengthen BC-PD-28 test to verify observer cache-warming effect TestPrefixThreshold_ObserverWarmsCache replaces the previous TestPrefixThreshold_ObserverCalledAfterRouting which only checked causality invariants (TransferCompleteTime != 0) and did not verify that the DisaggregationObserver actually warmed the prefix cache. The new test uses two requests with a shared 192-token prefix (threshold=150): req1 disaggregates (192 > 150 with empty cache), observer records 12 blocks, req2 arrives later with the same prefix (58 non-cached tokens ≤ 150) and is routed locally — proving the observer was called and the cache was warmed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: fix extension-recipes step count and remove broken configuration.md cross-link - extension-recipes.md: add missing step 4 (update configuration.md table) and fix touch-point count from 3 to 4 for stateless disaggregation deciders - configuration.md: replace broken cross-link to non-existent architecture.md#pd-disaggregation with descriptive text documenting the pool topology constraint (prefill + decode <= num-instances) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): restore R1 compliance and state mutation order in DecodeRoutingEvent - Restore droppedKVAllocations counter and DroppedKVAllocations() accessor (R1: count dropped requests — never silent) - Move AssignedInstance/DecodeInstanceID/DecodeEnqueueTime assignment to after successful AllocateTransferredKV check to prevent inconsistent state on failure - Restore DroppedKVAllocations() in cmd/root.go anomaly counter output - Add INV-PD-1 through INV-PD-5 back to invariants.md for DRY compliance (these were in CLAUDE.md but not the canonical standards doc) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: address review findings — silent config failure, stale docs, structural tests Critical: Add CLI validation that --pd-decider other than "never" requires --prefill-instances and --decode-instances to be set (R1: no silent config failure). Important: - Remove superseded disaggregation decider recipe (incorrect field names, wrong flag names, stale purity contract) from extension-recipes.md - Add prefix-threshold to first PD flags table in configuration.md - Replace internal field accesses (cs.parentRequests, cs.transfersInitiated, cs.trace, cs.droppedAtDecodeKV) with public accessors (cs.ParentRequests(), cs.Trace(), cs.DroppedKVAllocations()) in disaggregation_test.go for refactor survival (BDD/TDD principle #5) Also includes minor comment condensations from code-simplifier agent. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): document Dropped KV Allocations anomaly counter Add the new PD-mode anomaly counter to the results guide so users understand what it means when decode KV allocation fails. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 <[email protected]>

…prove code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests.

…#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle Add node/GPU placement, instance lifecycle management, multi-model routing, and per-model metrics to the cluster simulator. Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore: add AGENTS.md to .gitignore - AGENTS.md is a generated file for agent context - Should not be committed to the repository * fix(cluster): prevent double-counting in drain redirect policy Fixes INV-1 conservation violation when DrainRedirect policy re-injects queued requests. Previously, redirected requests were counted twice in CompletedRequests: once when initially injected and again when completed after redirection. Changes: - Add Request.Redirected field to track re-injected requests - Mark requests as redirected in drainRedirect.Drain() before re-injection - Skip CompletedRequests increment in recordRequestCompletion() for redirected requests - Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix The fix ensures INV-1 (request conservation) holds: injected = completed + queued + running + dropped + timed_out Addresses PR #697 review feedback on drain redirect conservation. * fix: address PR #697 review comments - Document DrainPolicy as Phase 1C infrastructure - Fix warm-up request overcounting by checking warmUpRemaining - Fix drain callback memory leak in MarkNodeTerminated - Add named constants for lifecycle event priorities - Clarify drainWait GPU release and swap-remove logic - Add .bob/ to .gitignore * Fix warm-up TTFT penalty implementation - Initialize warmUpRemaining for all instances in backward-compat mode - Fix indentation in cluster_event.go - Update warm-up recording to track first N requests - Adjust test expectations to account for queueing effects Fixes build and test failures in PR. * chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory * feat(latency): decouple quantized weight precision from compute dtype (#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * feat(sim): add L40S GPU and FP8 compute support (#765) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics - sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++. The guard caused redirected requests to vanish from INV-1 accounting: source's InjectedRequests=0 (drained from WaitQ before completion), destination's InjectedRequests=0 (skipped CompletedRequests). Destination is the sole completion site so incrementing there preserves conservation. - cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition. Clusters where all failures are routing rejections (no routable instances) silently omitted the anomaly summary block (I3 from sriumcp review). - sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter explicit (non-variadic). Prevents call sites from silently passing 0 and missing routing rejections. Updated all test call sites to pass 0 explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * revert(ci): remove skip-cache from golangci-lint step Unrelated to Phase 1A changes. skip-cache: true was added during development but should not be merged to main. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests - sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to actually seed requests into inst0's WaitQ before drain. Previous version used a manual event loop on an empty queue — DrainWaitQueue() returned [] and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest directly + empty workload so Run() doesn't push duplicate arrivals. Also adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and clusterEvents non-empty after drain. - sim/request.go: Update Redirected field comment to reflect actual behavior. Previous comment said "completion accounting is skipped" — opposite of what simulator.go:recordRequestCompletion now does. - sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry for redirected requests before re-injection. Source registered the request at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry. Destination re-registers on re-enqueue, causing a spurious "duplicate request ID" WARN in aggregateMetrics() for every redirected request. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: address sriumcp documentation gaps (GAP 1-4) GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML schema to the Policy Bundle section so users can discover and configure Phase 1A features. Both are YAML-only (no CLI flags). Add a note block explaining backward compatibility. Update DeploymentConfig row in the config-to-flag mapping table to note the YAML-only fields. GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to "Rejected Requests (Admission)" to match actual CLI output label. Add new "Rejected Requests (Routing)" row explaining when it fires (no routable instances — all Loading/Draining) and the remediation action. GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the example output block to match printPerModelMetrics which uses m.TTFT.P50. Add tok/s to the Throughput example line to match actual output format. GAP 4 — results.md per_model JSON: Add table documenting the per_model key in --results-path JSON output (omitted when no model tags present), with field-by-field description. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --------- Co-authored-by: tantawi <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>

sriumcp added 3 commits June 11, 2025 21:24

getting started instructions

684e7ec

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

remove matched var

130cf3a

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

Merge remote-tracking branch 'upstream/main' into simplifykvcache

274a25d

sriumcp requested a review from susiejojo June 13, 2025 17:16

sriumcp self-assigned this Jun 13, 2025

susiejojo reviewed Jun 13, 2025

View reviewed changes

sim/kvcache.go Outdated Show resolved Hide resolved

sim/kvcache.go Outdated Show resolved Hide resolved

sim/simulator.go Outdated Show resolved Hide resolved

replace bid with blockId

a7293e3

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

sriumcp requested a review from susiejojo June 13, 2025 20:19

sriumcp added 2 commits June 13, 2025 16:28

fix duplicate code

87dcccd

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

cachestatefor instead of cachestate

86b624a

Signed-off-by: Srinivasan Parthasarathy <[email protected]>

susiejojo approved these changes Jun 13, 2025

View reviewed changes

sriumcp merged commit 4de0812 into inference-sim:main Jun 13, 2025

kalantar mentioned this pull request Feb 25, 2026

fix(skill): redesign convergence-review loop with two-phase execution #431

Closed

6 tasks

susiejojo mentioned this pull request Mar 12, 2026

Docs: comprehensive first-time user review — hardcoded metrics mismatches, tutorial narrative gaps, and UX friction #550

Closed

9 tasks

claude bot mentioned this pull request Mar 12, 2026

docs: fix hardcoded metrics mismatches and tutorial narrative gaps #632

Merged

4 tasks

susiejojo mentioned this pull request Mar 16, 2026

feat: Add --quantization flag for on-the-fly quantization #629

Open

7 tasks

sriumcp mentioned this pull request Mar 19, 2026

feat(latency): decouple quantized weight precision from compute dtype #698

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify kv cache (by removing matched variable)#5

Simplify kv cache (by removing matched variable)#5
sriumcp merged 6 commits intoinference-sim:mainfrom
sriumcp:simplifykvcache

sriumcp commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sriumcp commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants