feat(latency): decouple quantized weight precision from compute dtype#698
feat(latency): decouple quantized weight precision from compute dtype#698
Conversation
1871f69 to
e242160
Compare
|
@claude Review this PR |
|
Claude finished @susiejojo's task in 7m 16s —— View job PR Review: feat(latency): decouple quantized weight precision from compute dtype
SummaryThis is a well-structured and well-tested PR. The design is sound — using a zero-sentinel for 🔴 Issue: FP8
|
|
@claude Have your review concerns from earlier been fixed now? |
|
Claude finished @susiejojo's task in 51s —— View job Review follow-up: all three concerns addressed ✅
Yes, all three concerns from the previous review have been resolved in the latest commit ( 🔴 FP8 case-insensitive match — Fixed
} else if strings.EqualFold(quantMethod, "fp8") {
weightBytesPerParam = 1.0
}And a new test 🟡
|
d24dec5 to
bbf6b2c
Compare
Follow-up: complete three-tier detection, remove
|
sriumcp
left a comment
There was a problem hiding this comment.
PR Review: Quantized Weight Precision Detection
The PR correctly addresses all explicit concerns from issue #443 — the three-tier detection chain covers all 4 quantization formats, roofline weight bandwidth is corrected, and KV capacity accounts for reduced weight memory. Test quality is excellent (BC-1 through BC-10 contracts, monotonicity/conservation invariant tests, regression anchors).
However, 2 critical issues and 4 important issues should be resolved before merge.
Critical Issues
C1. cmd/replay.go not updated — R23 code path parity violation
cmd/replay.go:345-358 has a parallel code path that loads HF config and constructs modelConfig for roofline/crossmodel/trained-roofline backends. This path is not updated with:
- The
InferWeightBytesFromModelNamefallback - The new quantization detection info/warning logging
- The stale warning at line 357 still reads
"possible quantization"instead of the new diagnostic messages
Impact: blis replay with a quantized model (e.g. --model redhatai/llama-3.3-70b-instruct-quantized.w4a16) will still overestimate weight memory by up to 4x.
Fix: Extract a shared helper function (e.g. applyWeightPrecisionFallback(mc *sim.ModelConfig, modelName string, hfCfg *latency.HFConfig)) that performs model-name inference, logs quantization info, and warns on undetermined precision. Call it from all three sites:
cmd/root.goblackbox auto-fetch path (~line 608)cmd/root.goroofline/crossmodel/trained-roofline path (~line 665)cmd/replay.go(~line 350)
This also resolves the DRY violation (see I1 below).
C2. Non-deterministic map iteration in compressed-tensors parsing — INV-6 violation
sim/latency/config.go — the for _, g := range cg loop over config_groups breaks on the first group with weights.num_bits > 0. If a compressed-tensors config has heterogeneous groups (e.g. group_0 at 4-bit, group_1 at 8-bit), Go's non-deterministic map iteration means different runs could pick different groups, producing different WeightBytesPerParam values.
Impact: Violates INV-6 (determinism: same seed produces byte-identical output).
Fix: Sort the map keys before iterating:
if cg, ok := qc["config_groups"].(map[string]any); ok {
keys := make([]string, 0, len(cg))
for k := range cg {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
g := cg[k]
// ... existing nested type assertions with break ...
}
}Important Issues
I1. DRY violation: duplicated model-name-fallback in cmd/root.go
The InferWeightBytesFromModelName call + quantization_config warning logic is copy-pasted in two code paths within cmd/root.go:
- Blackbox auto-fetch path (~lines 608-618): missing the
"quantized model detected"info log - Roofline path (~lines 665-684): has the full logging
Fix: Resolved by C1's shared helper extraction.
I2. Silent type coercion failure in bits field
sim/latency/config.go — if a HF config has "bits": "4" (string instead of float64), the type assertion qc["bits"].(float64) silently fails, bits stays 0, and GPTQ/AWQ models fall through without detecting weight precision. The model name fallback in cmd/root.go would likely catch it, but parsing should be more robust.
Fix: Add string-to-int coercion and log unexpected types:
if bitsRaw, ok := qc["bits"].(float64); ok {
bits = int(bitsRaw)
} else if bitsStr, ok := qc["bits"].(string); ok {
if parsed, err := strconv.Atoi(bitsStr); err == nil {
bits = parsed
}
} else if qc["bits"] != nil {
logrus.Warnf("quantization_config.bits has unexpected type %T; skipping", qc["bits"])
}I3. No rooflineStepTime end-to-end test for quantized models
Tests verify calculateMemoryAccessBytes produces correct weight memory reduction (0.25x for W4A16), but no test calls rooflineStepTime with a quantized config and verifies lower step latency. A bug in how weight bytes interact with the roofline max(compute, memory) crossover would not be caught.
Fix: Add a test in sim/latency/roofline_test.go:
func TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode(t *testing.T) {
// Memory-bound decode: smaller weight bandwidth produces faster step time
fp16 := testModelConfig() // BytesPerParam=2.0, WeightBytesPerParam=0
w4a16 := testW4A16Config() // BytesPerParam=2.0, WeightBytesPerParam=0.5
hw := testHWConfig()
step := StepConfig{NumTokens: 1, TotalKV: 64, IncludeKVCache: true} // decode step
fp16Time := rooflineStepTime(fp16, hw, step, 1)
w4Time := rooflineStepTime(w4a16, hw, step, 1)
if w4Time >= fp16Time {
t.Errorf("W4A16 decode should be faster (less weight bandwidth): W4=%d >= FP16=%d", w4Time, fp16Time)
}
}I4. Missing Inf test for ValidateRooflineConfig.WeightBytesPerParam
The validation checks for math.IsInf but there's no corresponding test in config_test.go (only negative and NaN are tested). CalculateKVBlocks tests cover Inf separately, but ValidateRooflineConfig is a separate entry point.
Fix: Add in sim/latency/config_test.go:
func TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError(t *testing.T) {
mc := sim.ModelConfig{NumHeads: 32, NumLayers: 32, HiddenDim: 4096, BytesPerParam: 2, WeightBytesPerParam: math.Inf(1)}
hc := sim.HardwareCalib{TFlopsPeak: 1000, BwPeakTBs: 3.35, MfuPrefill: 0.5, MfuDecode: 0.3}
err := latency.ValidateRooflineConfig(mc, hc)
if err == nil {
t.Fatal("expected error for Inf WeightBytesPerParam")
}
if !strings.Contains(err.Error(), "WeightBytesPerParam") {
t.Errorf("error should mention WeightBytesPerParam, got: %v", err)
}
}Minor
- Double blank line in
cmd/root.goinit() (~line 1266): remove one of the two consecutive blank lines after the workload type flag. - Style consistency: new tests use
t.Errorfwith manual if-checks while existing tests insim/config_test.gouse testifyassert. Not a blocker but worth aligning.
Documentation GapsThe PR updated D1.
|
Review fixes addressed (commit 88c548b)Critical
Important
Minor
All tests pass ( |
… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>
1604c72 to
3df8b8b
Compare
… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>
3df8b8b to
1a60346
Compare
Convergence Review — Round 1 (pr-code gate, sonnet model)Branch: Result: NOT CONVERGED — 0 CRITICAL, 14 IMPORTANT, ~28 SUGGESTIONIMPORTANT Findings (14)I1: Non-deterministic map iteration in compressed-tensors parsing (INV-6 violation)Location: Sort config group keys before iterating. I2: Stale documentation — 3 pages say "quantization not yet modeled"Locations: Update all three to reflect auto-detection feature. I3: DRY violation — duplicate model-name fallback + warning in two CLI pathsLocation: Extract shared helper. I4: Trained-roofline backend silently ignores
|
…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>
e5dec14 to
93844c8
Compare
Review Comments AddressedFrom #issuecomment-4078259809 (D1–D5: Documentation gaps)All five addressed in
From #issuecomment-4083542833 (Convergence review round 1)Fixed:
Intentional design (no change):
|
…#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]>
…uant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…verage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]>
… CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…verage, validation comment
C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
and add model name fallback to cmd/replay.go roofline path (was missing).
Three call sites (root.go blackbox, root.go roofline, replay.go) now
share identical fallback + logging logic.
C2: Sort config_groups keys before iteration in compressed-tensors
parsing (INV-6 determinism).
I2: Add string-to-int coercion for quantization_config.bits field
(handles "bits": "4" from some HF configs).
I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
end-to-end test verifying quantized model produces lower decode
step time than FP16 in memory-bound regime.
I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
covering +Inf validation gap.
Minor: Remove double blank line in root.go init().
Co-Authored-By: Claude Opus 4.6 <[email protected]>
… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>
93844c8 to
4b51d0c
Compare
sriumcp
left a comment
There was a problem hiding this comment.
Quick-Review Findings
Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.
Medium Severity Issues
1. Duplicated quantization warning logic across files (DRY violation)
- Location: cmd/root.go:655-662, cmd/replay.go:358-361
- Issue: The function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
- Suggestion: Consider consolidating the warning logic into or a shared helper function to maintain DRY principle.
2. WeightBytesPerParam > BytesPerParam case not validated
- Location: sim/latency/config.go:385-390
- Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is 'unusual but not invalid', this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
- Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.
3. Compressed-tensors parsing silently fails on malformed config_groups
- Location: sim/latency/config.go:345-365
- Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
- Suggestion: Add debug logging when config_groups structure doesn't match expected schema.
Low Severity Issues
4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)
- Location: sim/latency/config.go:195-377 (182 lines)
- Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
- Suggestion: Extract quantization parsing into a separate function:
parseQuantizationConfig(qc map[string]any) float64
5. String-to-int coercion for bits field lacks error handling
- Location: sim/latency/config.go:329-335
- Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like 'abc' would silently result in weightBytesPerParam=0 without warning.
- Suggestion: Log a warning when string-to-int coercion fails.
6. Magic number 8.0 for bits-to-bytes conversion
- Location: sim/latency/config.go:327, 342, 365
- Issue: Division by 8.0 appears multiple times without a named constant.
- Suggestion: Define
const bitsPerByte = 8.0at package level.
7. Map iteration order in compressed-tensors config_groups
- Location: sim/latency/config.go:349-364
- Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
- Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.
8. Roofline approximation comment could be more specific
- Location: sim/latency/roofline.go:200-203
- Issue: Comment describes a 'known approximation' for quantized models but doesn't quantify the expected error magnitude.
- Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.
9. Inconsistent spacing in concatenated string
- Location: cmd/root.go:576
- Issue: Inconsistent spacing in string concatenation compared to line 659.
- Suggestion: Use consistent spacing pattern.
10. Inconsistent struct field alignment
- Location: sim/model_hardware_config.go:19
- Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
- Suggestion: Align field with others for consistency.
All findings have been added to Bob Findings panel for detailed review.
sriumcp
left a comment
There was a problem hiding this comment.
Quick-Review Findings
Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.
Medium Severity Issues
1. Duplicated quantization warning logic across files (DRY violation)
- Location: cmd/root.go:655-662, cmd/replay.go:358-361
- Issue: The applyWeightPrecisionFallback function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
- Suggestion: Consider consolidating the warning logic into applyWeightPrecisionFallback or a shared helper function to maintain DRY principle.
2. WeightBytesPerParam > BytesPerParam case not validated
- Location: sim/latency/config.go:385-390
- Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is unusual but not invalid, this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
- Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.
3. Compressed-tensors parsing silently fails on malformed config_groups
- Location: sim/latency/config.go:345-365
- Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
- Suggestion: Add debug logging when config_groups structure doesn't match expected schema.
Low Severity Issues
4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)
- Location: sim/latency/config.go:195-377 (182 lines)
- Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
- Suggestion: Extract quantization parsing into a separate function: parseQuantizationConfig(qc map[string]any) float64
5. String-to-int coercion for bits field lacks error handling
- Location: sim/latency/config.go:329-335
- Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like abc would silently result in weightBytesPerParam=0 without warning.
- Suggestion: Log a warning when string-to-int coercion fails.
6. Magic number 8.0 for bits-to-bytes conversion
- Location: sim/latency/config.go:327, 342, 365
- Issue: Division by 8.0 appears multiple times without a named constant.
- Suggestion: Define const bitsPerByte = 8.0 at package level.
7. Map iteration order in compressed-tensors config_groups
- Location: sim/latency/config.go:349-364
- Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
- Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.
8. Roofline approximation comment could be more specific
- Location: sim/latency/roofline.go:200-203
- Issue: Comment describes a known approximation for quantized models but doesn't quantify the expected error magnitude.
- Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.
9. Inconsistent spacing in concatenated string
- Location: cmd/root.go:576
- Issue: Inconsistent spacing in string concatenation compared to line 659.
- Suggestion: Use consistent spacing pattern.
10. Inconsistent struct field alignment
- Location: sim/model_hardware_config.go:19
- Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
- Suggestion: Align field with others for consistency.
All findings have been added to Bob Findings panel for detailed review.
sriumcp
left a comment
There was a problem hiding this comment.
Follow-up Review: Remaining Issues
Thank you for addressing issues #6 (bitsPerByte constant) and #8 (roofline approximation comment). However, 8 issues remain unaddressed:
Medium Severity (Still Open)
Issue #1: Duplicated quantization warning logic (DRY violation)
- Status: Not addressed
- Location: cmd/root.go:655-662, cmd/replay.go:358-361
- Issue: The trained-roofline warning still appears twice with identical text
- Suggestion: Consolidate warning logic into applyWeightPrecisionFallback or shared helper
Issue #2: WeightBytesPerParam > BytesPerParam case not validated
- Status: Not addressed
- Location: sim/latency/config.go:383-393
- Issue: Comment acknowledges this is unusual but no warning is logged when it occurs
- Suggestion: Add logrus.Warnf when WeightBytesPerParam > BytesPerParam to alert users
Issue #3: Compressed-tensors parsing silently fails on malformed config_groups
- Status: Not addressed
- Location: sim/latency/config.go:345-365
- Issue: Malformed config_groups structure fails silently without logging
- Suggestion: Add logrus.Debugf when config_groups structure doesn't match expected schema
Low Severity (Still Open)
Issue #4: GetModelConfigFromHF function exceeds 50 lines (R14 violation)
- Status: Not addressed
- Location: sim/latency/config.go:195-377 (182 lines)
- Suggestion: Extract quantization parsing: parseQuantizationConfig(qc map[string]any) float64
Issue #5: String-to-int coercion for bits field lacks error handling
- Status: Not addressed
- Location: sim/latency/config.go:329-335
- Issue: Invalid string values like abc silently result in weightBytesPerParam=0
- Suggestion: Log warning when strconv.Atoi fails
Issue #7: Map iteration order in compressed-tensors config_groups
- Status: Not addressed
- Location: sim/latency/config.go:349-364
- Issue: First-match semantics not explained in comments
- Suggestion: Add comment explaining why first valid num_bits is used
Issue #9: Inconsistent spacing in concatenated string
- Status: Not addressed
- Location: cmd/root.go:576
- Issue: Space before closing quote differs from line 659 pattern
- Suggestion: Use consistent spacing pattern
Issue #10: Inconsistent struct field alignment
- Status: Not addressed
- Location: sim/model_hardware_config.go:19
- Issue: WeightBytesPerParam field breaks alignment pattern
- Suggestion: Align with other fields for consistency
Progress: 2/10 issues resolved. Please address remaining issues, prioritizing the 3 medium severity items.
…prove code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests.
sriumcp
left a comment
There was a problem hiding this comment.
✅ All Review Issues Resolved
Excellent work addressing all 10 review findings! All issues have been successfully resolved:
Resolved Issues
✅ Issue #1 (Medium): Duplicated quantization warning logic - Consolidated into helper function
✅ Issue #2 (Medium): WeightBytesPerParam > BytesPerParam validation - Added warning in at line 415-418
✅ Issue #3 (Medium): Compressed-tensors parsing silent failures - Added debug logging for malformed config_groups structure
✅ Issue #4 (Low): GetModelConfigFromHF function length - Extracted helper function (61 lines), reducing main function complexity
✅ Issue #5 (Low): String-to-int coercion error handling - Added debug logging when bits string-to-int conversion fails
✅ Issue #6 (Low): Magic number 8.0 - Replaced with
✅ Issue #7 (Low): Map iteration semantics - Added comment explaining first-match semantics in compressed-tensors parsing
✅ Issue #8 (Low): Roofline approximation comment - Enhanced with quantitative guidance (~2× overestimate, references h-quantized-roofline)
✅ Issue #9 (Low): Inconsistent string spacing - Fixed spacing in cmd/root.go:576
✅ Issue #10 (Low): Struct field alignment - WeightBytesPerParam field properly aligned in model_hardware_config.go
Code Quality Improvements
- Better separation of concerns with helper functions
- Improved error handling and diagnostic logging
- Enhanced documentation with quantitative guidance
- Maintained backward compatibility
- All changes follow project standards (R-rules, INV-invariants)
Status: 10/10 issues resolved. Ready to merge! 🎉
* feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
…inference-sim#698) * feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR inference-sim#698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR inference-sim#698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR inference-sim#698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR inference-sim#698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes inference-sim#762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion inference-sim#589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle Add node/GPU placement, instance lifecycle management, multi-model routing, and per-model metrics to the cluster simulator. Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore: add AGENTS.md to .gitignore - AGENTS.md is a generated file for agent context - Should not be committed to the repository * fix(cluster): prevent double-counting in drain redirect policy Fixes INV-1 conservation violation when DrainRedirect policy re-injects queued requests. Previously, redirected requests were counted twice in CompletedRequests: once when initially injected and again when completed after redirection. Changes: - Add Request.Redirected field to track re-injected requests - Mark requests as redirected in drainRedirect.Drain() before re-injection - Skip CompletedRequests increment in recordRequestCompletion() for redirected requests - Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix The fix ensures INV-1 (request conservation) holds: injected = completed + queued + running + dropped + timed_out Addresses PR #697 review feedback on drain redirect conservation. * fix: address PR #697 review comments - Document DrainPolicy as Phase 1C infrastructure - Fix warm-up request overcounting by checking warmUpRemaining - Fix drain callback memory leak in MarkNodeTerminated - Add named constants for lifecycle event priorities - Clarify drainWait GPU release and swap-remove logic - Add .bob/ to .gitignore * Fix warm-up TTFT penalty implementation - Initialize warmUpRemaining for all instances in backward-compat mode - Fix indentation in cluster_event.go - Update warm-up recording to track first N requests - Adjust test expectations to account for queueing effects Fixes build and test failures in PR. * chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory * feat(latency): decouple quantized weight precision from compute dtype (#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * feat(sim): add L40S GPU and FP8 compute support (#765) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics - sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++. The guard caused redirected requests to vanish from INV-1 accounting: source's InjectedRequests=0 (drained from WaitQ before completion), destination's InjectedRequests=0 (skipped CompletedRequests). Destination is the sole completion site so incrementing there preserves conservation. - cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition. Clusters where all failures are routing rejections (no routable instances) silently omitted the anomaly summary block (I3 from sriumcp review). - sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter explicit (non-variadic). Prevents call sites from silently passing 0 and missing routing rejections. Updated all test call sites to pass 0 explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * revert(ci): remove skip-cache from golangci-lint step Unrelated to Phase 1A changes. skip-cache: true was added during development but should not be merged to main. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests - sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to actually seed requests into inst0's WaitQ before drain. Previous version used a manual event loop on an empty queue — DrainWaitQueue() returned [] and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest directly + empty workload so Run() doesn't push duplicate arrivals. Also adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and clusterEvents non-empty after drain. - sim/request.go: Update Redirected field comment to reflect actual behavior. Previous comment said "completion accounting is skipped" — opposite of what simulator.go:recordRequestCompletion now does. - sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry for redirected requests before re-injection. Source registered the request at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry. Destination re-registers on re-enqueue, causing a spurious "duplicate request ID" WARN in aggregateMetrics() for every redirected request. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: address sriumcp documentation gaps (GAP 1-4) GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML schema to the Policy Bundle section so users can discover and configure Phase 1A features. Both are YAML-only (no CLI flags). Add a note block explaining backward compatibility. Update DeploymentConfig row in the config-to-flag mapping table to note the YAML-only fields. GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to "Rejected Requests (Admission)" to match actual CLI output label. Add new "Rejected Requests (Routing)" row explaining when it fires (no routable instances — all Loading/Draining) and the remediation action. GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the example output block to match printPerModelMetrics which uses m.TTFT.P50. Add tok/s to the Throughput example line to match actual output format. GAP 4 — results.md per_model JSON: Add table documenting the per_model key in --results-path JSON output (omitted when no model tags present), with field-by-field description. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --------- Co-authored-by: tantawi <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>
Summary
Closes #443 — Roofline latency model and KV capacity auto-calculation now correctly handle quantized models (GPTQ, AWQ, FP8).
BytesPerParamwas derived fromtorch_dtype(the compute dtype), so quantized models like W4A16 reported 2.0 bytes/param instead of 0.5. This caused roofline to overestimate weight bandwidth by 4x and KV capacity to undercount available blocks.WeightBytesPerParamfield toModelConfig(zero-sentinel = fallback toBytesPerParam), parsed from HFquantization_config. Rooflinemodel_weightsand KV capacitycomputeModelWeightBytes()now useEffectiveWeightBytesPerParam(), while KV cache and activations continue usingBytesPerParam.--weight-bytes-per-paramflag for manual override when auto-detection is insufficient.Changes
sim/model_hardware_config.goWeightBytesPerParamfield +EffectiveWeightBytesPerParam()methodsim/latency/config.goquantization_config+ validationsim/latency/roofline.goEffectiveWeightBytesPerParam()formodel_weightssim/latency/kv_capacity.goEffectiveWeightBytesPerParam()for weight memorycmd/root.go--weight-bytes-per-paramflag + quantization loggingCLAUDE.mdTest plan
EffectiveWeightBytesPerParam()(sentinel, set, both-zero)quantization_configparsing (GPTQ, FP8, AWQ, no-quant, bits=0, regression)ValidateRooflineConfig(negative, NaN, valid)go build ./...passesgo vet ./...clean🤖 Generated with Claude Code