Skip to content

feat(latency): decouple quantized weight precision from compute dtype#698

Merged
sriumcp merged 13 commits intomainfrom
feat/443-quantized-weight-bytes-per-param
Mar 19, 2026
Merged

feat(latency): decouple quantized weight precision from compute dtype#698
sriumcp merged 13 commits intomainfrom
feat/443-quantized-weight-bytes-per-param

Conversation

@susiejojo
Copy link
Copy Markdown
Contributor

@susiejojo susiejojo commented Mar 16, 2026

Summary

Closes #443 — Roofline latency model and KV capacity auto-calculation now correctly handle quantized models (GPTQ, AWQ, FP8).

  • Problem: BytesPerParam was derived from torch_dtype (the compute dtype), so quantized models like W4A16 reported 2.0 bytes/param instead of 0.5. This caused roofline to overestimate weight bandwidth by 4x and KV capacity to undercount available blocks.
  • Fix: Added WeightBytesPerParam field to ModelConfig (zero-sentinel = fallback to BytesPerParam), parsed from HF quantization_config. Roofline model_weights and KV capacity computeModelWeightBytes() now use EffectiveWeightBytesPerParam(), while KV cache and activations continue using BytesPerParam.
  • CLI: New --weight-bytes-per-param flag for manual override when auto-detection is insufficient.

Changes

File Change
sim/model_hardware_config.go WeightBytesPerParam field + EffectiveWeightBytesPerParam() method
sim/latency/config.go Parse quantization_config + validation
sim/latency/roofline.go Use EffectiveWeightBytesPerParam() for model_weights
sim/latency/kv_capacity.go Use EffectiveWeightBytesPerParam() for weight memory
cmd/root.go --weight-bytes-per-param flag + quantization logging
CLAUDE.md Documentation updates

Test plan

  • 3 unit tests for EffectiveWeightBytesPerParam() (sentinel, set, both-zero)
  • 6 tests for quantization_config parsing (GPTQ, FP8, AWQ, no-quant, bits=0, regression)
  • 3 tests for ValidateRooflineConfig (negative, NaN, valid)
  • 3 tests for roofline memory split (W4A16 weights 1/4, KV unchanged, conservation)
  • 3 tests for KV capacity (W4A16 more blocks, monotonicity FP16<FP8<W4, regression)
  • All existing tests pass (zero regressions)
  • go build ./... passes
  • go vet ./... clean

🤖 Generated with Claude Code

@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch 2 times, most recently from 1871f69 to e242160 Compare March 17, 2026 16:14
@susiejojo
Copy link
Copy Markdown
Contributor Author

@claude Review this PR

@claude
Copy link
Copy Markdown

claude bot commented Mar 17, 2026

Claude finished @susiejojo's task in 7m 16s —— View job


PR Review: feat(latency): decouple quantized weight precision from compute dtype

  • Gather context and understand the request
  • Read changed files
  • Check standards compliance (rules.md, invariants.md)
  • Post review feedback

Summary

This is a well-structured and well-tested PR. The design is sound — using a zero-sentinel for WeightBytesPerParam with EffectiveWeightBytesPerParam() as the single accessor is idiomatic and avoids polluting calling code with nil checks. I found one correctness issue, two minor concerns, and several positive observations.


🔴 Issue: FP8 quantization_config parsing doesn't check bits=8 — only falls back when bits <= 0

File: sim/latency/config.go:227–235

if bits > 0 {
    weightBytesPerParam = float64(bits) / 8.0
} else if quantMethod == "fp8" {
    weightBytesPerParam = 1.0
}

The bits-first path runs unconditionally for any bits > 0. For an FP8 model that also includes "bits": 8, both branches would agree (8/8 = 1.0). However, if a future FP8 config sets "bits": 4 (e.g., a W4 model that accidentally also sets quant_method: "fp8"), the bits branch wins and the quant_method == "fp8" fallback is skipped. This is probably fine in practice today but means the precedence is implicit. More importantly: this also means that for FP8 models with "bits": 8, the bits path produces the correct 1.0 — but only coincidentally. Consider making this explicit:

if bits > 0 {
    weightBytesPerParam = float64(bits) / 8.0
} else if strings.EqualFold(quantMethod, "fp8") {
    weightBytesPerParam = 1.0
}

The current test (TestGetModelConfig_FP8_ParsesQuantizationConfig) only exercises the bits-absent path, so it doesn't catch the interaction case. This is the only logic gap I could identify. Fix this →


🟡 Minor: ValidateRooflineConfig doesn't validate WeightBytesPerParam > BytesPerParam

File: sim/latency/config.go:329–337

The function validates WeightBytesPerParam is positive when set, but doesn't warn or error when WeightBytesPerParam > BytesPerParam. A quantized weight precision that exceeds the compute dtype (e.g., WeightBytesPerParam=4.0 with BytesPerParam=2.0) would silently produce inflated roofline estimates. This isn't blocked by the issue spec but is a footgun worth a comment at minimum:

// WeightBytesPerParam > BytesPerParam is unusual but not invalid (e.g., INT4 KV cache
// with FP32 weights). No upper-bound check is enforced here.

Fix this →


🟡 Minor: JSON tag weight_bytes_per_param,omitempty on ModelConfig

File: sim/model_hardware_config.go:19

WeightBytesPerParam float64 `json:"weight_bytes_per_param,omitempty"`

The zero-sentinel pattern relies on 0 meaning "not set." But omitempty on float64 omits the field when the value is 0. This is internally consistent — the field won't round-trip through JSON with a value of 0 — but could confuse consumers of any serialized ModelConfig output who expect to see weight_bytes_per_param: 0 explicitly. This is a documentation/contract issue, not a functional bug, since no serialized ModelConfig is consumed at runtime in BLIS today.


✅ Positive observations

Design:

  • Zero-sentinel for WeightBytesPerParam with EffectiveWeightBytesPerParam() as the single accessor is clean. All callers go through one point, so changing semantics later requires changing one function.
  • The ValidateRooflineConfig and CalculateKVBlocks validation for WeightBytesPerParam is consistent: both use the same != 0 outer guard + < 0 || IsNaN || IsInf inner check (no duplication, no divergence).
  • computeModelWeightBytes correctly uses EffectiveWeightBytesPerParam() for weight memory, while perTokenKVBytesF correctly uses BytesPerParam directly — the separation of concerns is explicit.

CLI flag handling:

  • Early validation at cmd/root.go:600–603 catches invalid --weight-bytes-per-param before backend-specific paths. The comment (R3) correctly ties this to the antipattern rules.
  • The cmd.Flags().Changed() guard is correct — 0 is the default, and using Changed() avoids the ambiguity of 0 meaning both "not set" and "zero value."
  • The per-backend logging at cmd/root.go:678–681 correctly distinguishes roofline (where weight precision affects step time) from crossmodel/trained-roofline (where it only affects KV capacity).

Tests:

  • BDD-style GIVEN/WHEN/THEN naming is followed throughout.
  • Invariant tests (monotonicity, conservation, regression anchor) complement the value-matching tests well — this aligns with the principles.md "Test laws, not just values" rule.
  • TestCalculateKVBlocks_W4A16_PerTokenKVBytesUnchanged verifies the key separation: KV bytes use BytesPerParam while weight memory uses WeightBytesPerParam. This is the most important behavioral invariant of the feature.
  • The TestGetModelConfig_NonQuantized_BytesPerParamUnchanged regression anchor at config_test.go:913 correctly calls EffectiveWeightBytesPerParam() for the fallback path, not just the raw field.

Standards compliance:

  • R4 (canonical constructors): NewKVCapacityParams is not changed and the new field is only on ModelConfig, which already had struct-literal construction. No new construction sites introduced.
  • R8 (unexported mutable maps): swiGLUActivations stays unexported, no new exported maps introduced.
  • INV-9 (oracle knowledge boundary): WeightBytesPerParam is used only for weight bandwidth and memory estimation — not in request scheduling or routing. No INV-9 violation.

Net Assessment

Approve with minor items. The correctness issue with FP8 parsing is low-risk (the test coverage for the current behavior is correct), and the two minor items are documentation/footgun concerns. The core logic is sound, the test suite is comprehensive, and the feature correctly resolves the issue #443 root cause.

@susiejojo
Copy link
Copy Markdown
Contributor Author

@claude Have your review concerns from earlier been fixed now?

@claude
Copy link
Copy Markdown

claude bot commented Mar 17, 2026

Claude finished @susiejojo's task in 51s —— View job


Review follow-up: all three concerns addressed ✅

  • Gather context and read current code
  • Check 🔴 FP8 case-insensitive match + test coverage
  • Check 🟡 WeightBytesPerParam > BytesPerParam validation comment
  • Check 🟡 omitempty JSON tag concern

Yes, all three concerns from the previous review have been resolved in the latest commit (d24dec5).

🔴 FP8 case-insensitive match — Fixed

config.go:232 now uses strings.EqualFold:

} else if strings.EqualFold(quantMethod, "fp8") {
    weightBytesPerParam = 1.0
}

And a new test TestGetModelConfig_FP8WithBits8_ParsesViaBitsPath (config_test.go:833) was added to cover the FP8+bits=8 interaction case, explicitly documenting that both code paths produce 1.0.

🟡 WeightBytesPerParam > BytesPerParam validation — Fixed

config.go:329–332 now has the requested comment:

// WeightBytesPerParam is optional (0 = not set, fall back to BytesPerParam).
// When set, it must be a valid positive number. No upper-bound check is enforced:
// WeightBytesPerParam > BytesPerParam is unusual but not invalid (e.g., INT4 KV cache
// with FP32 weights). Callers should not assume weight precision <= compute precision.

🟡 omitempty JSON tag — Acknowledged (no change needed)

This was flagged as a documentation/contract concern rather than a functional bug. The omitempty tag on WeightBytesPerParam in model_hardware_config.go:19 is still present — which is correct, since 0 means "not set" and omitting it in JSON output is the right behavior for the zero-sentinel pattern. No code change was needed here.


Net assessment: approve. The correctness fix and test coverage are solid. The PR is ready to merge.

@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from d24dec5 to bbf6b2c Compare March 17, 2026 17:55
@susiejojo
Copy link
Copy Markdown
Contributor Author

Follow-up: complete three-tier detection, remove --weight-bytes-per-param flag

Based on issue #443 comments discussion, this commit addresses all three proposed detection options and simplifies the CLI:

What changed:

  1. Added compressed-tensors parsing (option 1 gap fix) — The original quantization_config parser handled GPTQ/AWQ (bits) and FP8 (implicit) but missed compressed-tensors, which stores precision in config_groups.*.weights.num_bits. Models like RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8 (quant_method: "compressed-tensors") were silently falling back to torch_dtype (2.0 instead of 1.0).

  2. Added model name detection (option 2) — InferWeightBytesFromModelName regex-matches HF naming conventions (w4a16 → 0.5, w8a8 → 1.0, FP8 → 1.0) as a fallback when quantization_config parsing doesn't yield a result.

  3. Removed --weight-bytes-per-param flag (option 3 → no longer needed) — With options 1 and 2 covering all HF quantization formats, the manual override has no practical use case. Removes ~35 lines of flag plumbing (variable, registration, validation, Changed() guards, warning messages referencing the flag).

Detection chain: quantization_config → model name conventions → torch_dtype fallback

Tests added: compressed-tensors W8A8, compressed-tensors W4, compressed-tensors missing config_groups (graceful degradation), 10-case table-driven model name detection test.

Verified: go build ./... ✓, go test ./... ✓ (all packages pass), go vet ./...

Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Quantized Weight Precision Detection

The PR correctly addresses all explicit concerns from issue #443 — the three-tier detection chain covers all 4 quantization formats, roofline weight bandwidth is corrected, and KV capacity accounts for reduced weight memory. Test quality is excellent (BC-1 through BC-10 contracts, monotonicity/conservation invariant tests, regression anchors).

However, 2 critical issues and 4 important issues should be resolved before merge.


Critical Issues

C1. cmd/replay.go not updated — R23 code path parity violation

cmd/replay.go:345-358 has a parallel code path that loads HF config and constructs modelConfig for roofline/crossmodel/trained-roofline backends. This path is not updated with:

  • The InferWeightBytesFromModelName fallback
  • The new quantization detection info/warning logging
  • The stale warning at line 357 still reads "possible quantization" instead of the new diagnostic messages

Impact: blis replay with a quantized model (e.g. --model redhatai/llama-3.3-70b-instruct-quantized.w4a16) will still overestimate weight memory by up to 4x.

Fix: Extract a shared helper function (e.g. applyWeightPrecisionFallback(mc *sim.ModelConfig, modelName string, hfCfg *latency.HFConfig)) that performs model-name inference, logs quantization info, and warns on undetermined precision. Call it from all three sites:

  1. cmd/root.go blackbox auto-fetch path (~line 608)
  2. cmd/root.go roofline/crossmodel/trained-roofline path (~line 665)
  3. cmd/replay.go (~line 350)

This also resolves the DRY violation (see I1 below).

C2. Non-deterministic map iteration in compressed-tensors parsing — INV-6 violation

sim/latency/config.go — the for _, g := range cg loop over config_groups breaks on the first group with weights.num_bits > 0. If a compressed-tensors config has heterogeneous groups (e.g. group_0 at 4-bit, group_1 at 8-bit), Go's non-deterministic map iteration means different runs could pick different groups, producing different WeightBytesPerParam values.

Impact: Violates INV-6 (determinism: same seed produces byte-identical output).

Fix: Sort the map keys before iterating:

if cg, ok := qc["config_groups"].(map[string]any); ok {
    keys := make([]string, 0, len(cg))
    for k := range cg {
        keys = append(keys, k)
    }
    sort.Strings(keys)
    for _, k := range keys {
        g := cg[k]
        // ... existing nested type assertions with break ...
    }
}

Important Issues

I1. DRY violation: duplicated model-name-fallback in cmd/root.go

The InferWeightBytesFromModelName call + quantization_config warning logic is copy-pasted in two code paths within cmd/root.go:

  • Blackbox auto-fetch path (~lines 608-618): missing the "quantized model detected" info log
  • Roofline path (~lines 665-684): has the full logging

Fix: Resolved by C1's shared helper extraction.

I2. Silent type coercion failure in bits field

sim/latency/config.go — if a HF config has "bits": "4" (string instead of float64), the type assertion qc["bits"].(float64) silently fails, bits stays 0, and GPTQ/AWQ models fall through without detecting weight precision. The model name fallback in cmd/root.go would likely catch it, but parsing should be more robust.

Fix: Add string-to-int coercion and log unexpected types:

if bitsRaw, ok := qc["bits"].(float64); ok {
    bits = int(bitsRaw)
} else if bitsStr, ok := qc["bits"].(string); ok {
    if parsed, err := strconv.Atoi(bitsStr); err == nil {
        bits = parsed
    }
} else if qc["bits"] != nil {
    logrus.Warnf("quantization_config.bits has unexpected type %T; skipping", qc["bits"])
}

I3. No rooflineStepTime end-to-end test for quantized models

Tests verify calculateMemoryAccessBytes produces correct weight memory reduction (0.25x for W4A16), but no test calls rooflineStepTime with a quantized config and verifies lower step latency. A bug in how weight bytes interact with the roofline max(compute, memory) crossover would not be caught.

Fix: Add a test in sim/latency/roofline_test.go:

func TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode(t *testing.T) {
    // Memory-bound decode: smaller weight bandwidth produces faster step time
    fp16 := testModelConfig()   // BytesPerParam=2.0, WeightBytesPerParam=0
    w4a16 := testW4A16Config()  // BytesPerParam=2.0, WeightBytesPerParam=0.5
    hw := testHWConfig()
    step := StepConfig{NumTokens: 1, TotalKV: 64, IncludeKVCache: true} // decode step

    fp16Time := rooflineStepTime(fp16, hw, step, 1)
    w4Time := rooflineStepTime(w4a16, hw, step, 1)

    if w4Time >= fp16Time {
        t.Errorf("W4A16 decode should be faster (less weight bandwidth): W4=%d >= FP16=%d", w4Time, fp16Time)
    }
}

I4. Missing Inf test for ValidateRooflineConfig.WeightBytesPerParam

The validation checks for math.IsInf but there's no corresponding test in config_test.go (only negative and NaN are tested). CalculateKVBlocks tests cover Inf separately, but ValidateRooflineConfig is a separate entry point.

Fix: Add in sim/latency/config_test.go:

func TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError(t *testing.T) {
    mc := sim.ModelConfig{NumHeads: 32, NumLayers: 32, HiddenDim: 4096, BytesPerParam: 2, WeightBytesPerParam: math.Inf(1)}
    hc := sim.HardwareCalib{TFlopsPeak: 1000, BwPeakTBs: 3.35, MfuPrefill: 0.5, MfuDecode: 0.3}
    err := latency.ValidateRooflineConfig(mc, hc)
    if err == nil {
        t.Fatal("expected error for Inf WeightBytesPerParam")
    }
    if !strings.Contains(err.Error(), "WeightBytesPerParam") {
        t.Errorf("error should mention WeightBytesPerParam, got: %v", err)
    }
}

Minor

  • Double blank line in cmd/root.go init() (~line 1266): remove one of the two consecutive blank lines after the workload type flag.
  • Style consistency: new tests use t.Errorf with manual if-checks while existing tests in sim/config_test.go use testify assert. Not a blocker but worth aligning.

@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Mar 17, 2026

Documentation Gaps

The PR updated CLAUDE.md but missed 3 user-facing docs that now contain stale "not supported" statements directly contradicting what this PR adds. These should be fixed in this PR since they will mislead users of quantized models.


D1. docs/guide/latency-models.md:213 — Stale "Current limitations" warning

Currently says:

quantization effects (FP8, W4A16, W8A8) are not yet modeled in the analytical backends. These are under active development. For quantized deployments, blackbox mode with calibrated coefficients provides the most accurate results today.

Fix: Replace the quantization portion with something like:

Quantized weight precision (FP8, W4A16, W8A8) is auto-detected from quantization_config, model name conventions, or torch_dtype fallback — weight bandwidth and KV capacity calculations use the detected precision. MFU values are still calibrated against FP16 measurements, so step time estimates for quantized models may be conservative (see roofline.go known-approximation comment). DP and EP are not yet modeled.

D2. docs/concepts/roofline.md:9-10 — Stale "Scope" note

Currently says:

Scope: TP-only, FP16/BF16

and:

does not model data parallelism (DP), expert parallelism (EP), or quantization effects (FP8, W4A16, W8A8)

Fix: Update the scope line to "Scope: TP-only, quantized weight memory supported" and replace the quantization portion:

Quantized weight precision is now modeled for memory bandwidth and KV capacity. The roofline model auto-detects weight precision from HuggingFace quantization_config (GPTQ, AWQ, FP8, compressed-tensors), model name conventions (w4a16, FP8), or falls back to torch_dtype. MFU calibration still assumes FP16 compute. DP and EP are not yet modeled.

D3. docs/reference/models.md:47-48 — Stale "Parallelism and quantization" note

Currently says:

quantization effects (FP8, W4A16, W8A8) are under active development. For quantized model deployments, use blackbox mode with calibrated coefficients for best accuracy.

Fix: Replace with:

Quantized weight precision (FP8, W4A16, W8A8) is auto-detected and used for weight bandwidth and KV capacity calculations in all analytical backends. MFU values are calibrated against FP16 measurements — step time estimates for quantized models may be slightly conservative. DP and EP are not yet modeled.

D4. docs/reference/project-structure.md:42 — Missing WeightBytesPerParam in file annotation

The model_hardware_config.go line should mention the new field. Fix: Update to:

│   ├── model_hardware_config.go # ModelConfig (including WeightBytesPerParam for quantized weight precision, with EffectiveWeightBytesPerParam() accessor), HardwareCalib structs ...

D5. (Optional) docs/contributing/extension-recipes.md — No recipe for adding a quantization format

If someone wants to add support for bitsandbytes (bnb) or HQQ, they need to know:

  1. Add a case in GetModelConfigFromHF() in sim/latency/config.go (the quantization_config parsing block)
  2. Optionally add a regex pattern in InferWeightBytesFromModelName() for model name detection
  3. Add tests covering the new format in sim/latency/config_test.go

A short "Adding a quantization format" entry in extension recipes would make this discoverable. Lower priority than D1-D4.

@susiejojo
Copy link
Copy Markdown
Contributor Author

Review fixes addressed (commit 88c548b)

Critical

  • C1: Extracted applyWeightPrecisionFallback shared helper into cmd/hfconfig.go. All three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. cmd/replay.go roofline path was missing model name fallback entirely — now fixed.
  • C2: Sorted config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism).

Important

  • I2: Added string-to-int coercion for quantization_config.bits field — handles "bits": "4" (string) from some HF configs via strconv.Atoi fallback.
  • I3: Added TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode — end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime (ratio observed: 0.27x).
  • I4: Added TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap.

Minor

  • Removed double blank line in root.go init().

All tests pass (go test ./...), go vet clean.

susiejojo added a commit that referenced this pull request Mar 17, 2026
… and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo added a commit that referenced this pull request Mar 18, 2026
… and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 1604c72 to 3df8b8b Compare March 18, 2026 13:31
@susiejojo susiejojo requested a review from sriumcp March 18, 2026 13:56
susiejojo added a commit that referenced this pull request Mar 18, 2026
… and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 3df8b8b to 1a60346 Compare March 18, 2026 14:47
@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Mar 18, 2026

Convergence Review — Round 1 (pr-code gate, sonnet model)

Branch: pr-698 | Commit: c5555ce | Perspectives: 10/10 complete

Result: NOT CONVERGED — 0 CRITICAL, 14 IMPORTANT, ~28 SUGGESTION


IMPORTANT Findings (14)

I1: Non-deterministic map iteration in compressed-tensors parsing (INV-6 violation)

Location: sim/latency/config.go:181 | Raised by: 7/10 perspectives

Sort config group keys before iterating.

I2: Stale documentation — 3 pages say "quantization not yet modeled"

Locations: docs/guide/latency-models.md:213, docs/concepts/roofline.md:9-10, docs/reference/models.md:47-48

Update all three to reflect auto-detection feature.

I3: DRY violation — duplicate model-name fallback + warning in two CLI paths

Location: cmd/root.go:605-617 and cmd/root.go:660-684

Extract shared helper.

I4: Trained-roofline backend silently ignores WeightBytesPerParam

Location: sim/latency/trained_roofline.go:151-153

Document limitation or add log warning.

I5: No upper-bound check on bits or WeightBytesPerParam

Location: sim/latency/config.go:171,264

Add bits > 0 && bits <= 64 guard.

I6: FP8 KV cache dtype not modeled

Location: sim/latency/roofline.go:159,165

Document as known limitation.

I7: GPTQ/AWQ group quantization overhead (~6%) ignored

Location: sim/latency/config.go:237-240

Document as known approximation.

I8: "Conservative (pessimistic)" comment is inaccurate

Location: sim/latency/roofline.go:200-206

Soften to "may be inaccurate".

I9: Duplicate WeightBytesPerParam validation (DRY)

Location: sim/latency/config.go:263 and kv_capacity.go:696

Centralize validation.

I10: No CLI flag for manual WeightBytesPerParam override

Location: cmd/root.go

Add --weight-bytes-per-param flag.

I11: No validated quantized model example in docs

Location: docs/guide/latency-models.md:72-78

Add W4A16 or FP8 example.

I12: Misleading quantization warning in blackbox path

Location: cmd/root.go:608

Suppress or clarify warning.

I13: Pre-existing FractionalBytesPerParam test semantically misleading

Location: sim/latency/kv_capacity_test.go:414

Update comment or convert to WeightBytesPerParam.

I14: Tests couple to internal map keys of unexported function

Location: sim/latency/roofline_test.go:1036,1068

Test through public API instead.

susiejojo added a commit that referenced this pull request Mar 18, 2026
…on (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo added a commit that referenced this pull request Mar 18, 2026
… and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo added a commit that referenced this pull request Mar 18, 2026
…on (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from e5dec14 to 93844c8 Compare March 18, 2026 16:56
@susiejojo
Copy link
Copy Markdown
Contributor Author

Review Comments Addressed

From #issuecomment-4078259809 (D1–D5: Documentation gaps)

All five addressed in 1a60346:

  • D1 docs/guide/latency-models.md — Removed "not yet modeled" language for quantization; now describes auto-detection and notes MFU calibration caveat
  • D2 docs/concepts/roofline.md — Updated scope from "TP-only, FP16/BF16" to "TP-only, quantized weight memory supported"; describes three-tier detection
  • D3 docs/reference/models.md — Removed blackbox recommendation for quantized models; lists supported formats (GPTQ, AWQ, FP8, compressed-tensors)
  • D4 docs/reference/project-structure.md — Added WeightBytesPerParam and EffectiveWeightBytesPerParam() to model_hardware_config.go annotation
  • D5 docs/contributing/extension-recipes.md — New "Adding New Quantization Formats" recipe with three-tier detection order and code pointers

From #issuecomment-4083542833 (Convergence review round 1)

Fixed:

# Issue Status
I1 Map iteration non-determinism (compressed-tensors) Already fixed — sort.Strings(keys) with INV-6 comment
I2 Stale documentation (3 pages) Fixed in D1–D3 above
I3 CLI code duplication Already fixed — applyWeightPrecisionFallback() in cmd/hfconfig.go
I4 Trained-roofline silently ignores WeightBytesPerParam Fixed in e5dec14 — emits warning in both run and replay paths explaining FP16 weight bandwidth is used for step time while quantized precision only affects KV capacity

Intentional design (no change):

# Issue Rationale
I5 No upper-bound on bits Garbage-in scenario; validation comment at config.go:382 documents the design choice
I9 Validation duplication Defense-in-depth — ValidateRooflineConfig and CalculateKVBlocks are independent entry points
I10 No --weight-bytes-per-param CLI flag Three-tier auto-detection covers standard formats; manual override deferred as follow-up

susiejojo and others added 7 commits March 19, 2026 10:14
…#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…uant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…verage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>
susiejojo and others added 4 commits March 19, 2026 10:14
… CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…verage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…on (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 93844c8 to 4b51d0c Compare March 19, 2026 14:15
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick-Review Findings

Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.

Medium Severity Issues

1. Duplicated quantization warning logic across files (DRY violation)

  • Location: cmd/root.go:655-662, cmd/replay.go:358-361
  • Issue: The function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
  • Suggestion: Consider consolidating the warning logic into or a shared helper function to maintain DRY principle.

2. WeightBytesPerParam > BytesPerParam case not validated

  • Location: sim/latency/config.go:385-390
  • Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is 'unusual but not invalid', this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
  • Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.

3. Compressed-tensors parsing silently fails on malformed config_groups

  • Location: sim/latency/config.go:345-365
  • Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
  • Suggestion: Add debug logging when config_groups structure doesn't match expected schema.

Low Severity Issues

4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)

  • Location: sim/latency/config.go:195-377 (182 lines)
  • Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
  • Suggestion: Extract quantization parsing into a separate function: parseQuantizationConfig(qc map[string]any) float64

5. String-to-int coercion for bits field lacks error handling

  • Location: sim/latency/config.go:329-335
  • Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like 'abc' would silently result in weightBytesPerParam=0 without warning.
  • Suggestion: Log a warning when string-to-int coercion fails.

6. Magic number 8.0 for bits-to-bytes conversion

  • Location: sim/latency/config.go:327, 342, 365
  • Issue: Division by 8.0 appears multiple times without a named constant.
  • Suggestion: Define const bitsPerByte = 8.0 at package level.

7. Map iteration order in compressed-tensors config_groups

  • Location: sim/latency/config.go:349-364
  • Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
  • Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.

8. Roofline approximation comment could be more specific

  • Location: sim/latency/roofline.go:200-203
  • Issue: Comment describes a 'known approximation' for quantized models but doesn't quantify the expected error magnitude.
  • Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.

9. Inconsistent spacing in concatenated string

  • Location: cmd/root.go:576
  • Issue: Inconsistent spacing in string concatenation compared to line 659.
  • Suggestion: Use consistent spacing pattern.

10. Inconsistent struct field alignment

  • Location: sim/model_hardware_config.go:19
  • Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
  • Suggestion: Align field with others for consistency.

All findings have been added to Bob Findings panel for detailed review.

Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick-Review Findings

Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.

Medium Severity Issues

1. Duplicated quantization warning logic across files (DRY violation)

  • Location: cmd/root.go:655-662, cmd/replay.go:358-361
  • Issue: The applyWeightPrecisionFallback function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
  • Suggestion: Consider consolidating the warning logic into applyWeightPrecisionFallback or a shared helper function to maintain DRY principle.

2. WeightBytesPerParam > BytesPerParam case not validated

  • Location: sim/latency/config.go:385-390
  • Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is unusual but not invalid, this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
  • Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.

3. Compressed-tensors parsing silently fails on malformed config_groups

  • Location: sim/latency/config.go:345-365
  • Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
  • Suggestion: Add debug logging when config_groups structure doesn't match expected schema.

Low Severity Issues

4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)

  • Location: sim/latency/config.go:195-377 (182 lines)
  • Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
  • Suggestion: Extract quantization parsing into a separate function: parseQuantizationConfig(qc map[string]any) float64

5. String-to-int coercion for bits field lacks error handling

  • Location: sim/latency/config.go:329-335
  • Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like abc would silently result in weightBytesPerParam=0 without warning.
  • Suggestion: Log a warning when string-to-int coercion fails.

6. Magic number 8.0 for bits-to-bytes conversion

  • Location: sim/latency/config.go:327, 342, 365
  • Issue: Division by 8.0 appears multiple times without a named constant.
  • Suggestion: Define const bitsPerByte = 8.0 at package level.

7. Map iteration order in compressed-tensors config_groups

  • Location: sim/latency/config.go:349-364
  • Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
  • Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.

8. Roofline approximation comment could be more specific

  • Location: sim/latency/roofline.go:200-203
  • Issue: Comment describes a known approximation for quantized models but doesn't quantify the expected error magnitude.
  • Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.

9. Inconsistent spacing in concatenated string

  • Location: cmd/root.go:576
  • Issue: Inconsistent spacing in string concatenation compared to line 659.
  • Suggestion: Use consistent spacing pattern.

10. Inconsistent struct field alignment

  • Location: sim/model_hardware_config.go:19
  • Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
  • Suggestion: Align field with others for consistency.

All findings have been added to Bob Findings panel for detailed review.

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review: Remaining Issues

Thank you for addressing issues #6 (bitsPerByte constant) and #8 (roofline approximation comment). However, 8 issues remain unaddressed:

Medium Severity (Still Open)

Issue #1: Duplicated quantization warning logic (DRY violation)

  • Status: Not addressed
  • Location: cmd/root.go:655-662, cmd/replay.go:358-361
  • Issue: The trained-roofline warning still appears twice with identical text
  • Suggestion: Consolidate warning logic into applyWeightPrecisionFallback or shared helper

Issue #2: WeightBytesPerParam > BytesPerParam case not validated

  • Status: Not addressed
  • Location: sim/latency/config.go:383-393
  • Issue: Comment acknowledges this is unusual but no warning is logged when it occurs
  • Suggestion: Add logrus.Warnf when WeightBytesPerParam > BytesPerParam to alert users

Issue #3: Compressed-tensors parsing silently fails on malformed config_groups

  • Status: Not addressed
  • Location: sim/latency/config.go:345-365
  • Issue: Malformed config_groups structure fails silently without logging
  • Suggestion: Add logrus.Debugf when config_groups structure doesn't match expected schema

Low Severity (Still Open)

Issue #4: GetModelConfigFromHF function exceeds 50 lines (R14 violation)

  • Status: Not addressed
  • Location: sim/latency/config.go:195-377 (182 lines)
  • Suggestion: Extract quantization parsing: parseQuantizationConfig(qc map[string]any) float64

Issue #5: String-to-int coercion for bits field lacks error handling

  • Status: Not addressed
  • Location: sim/latency/config.go:329-335
  • Issue: Invalid string values like abc silently result in weightBytesPerParam=0
  • Suggestion: Log warning when strconv.Atoi fails

Issue #7: Map iteration order in compressed-tensors config_groups

  • Status: Not addressed
  • Location: sim/latency/config.go:349-364
  • Issue: First-match semantics not explained in comments
  • Suggestion: Add comment explaining why first valid num_bits is used

Issue #9: Inconsistent spacing in concatenated string

  • Status: Not addressed
  • Location: cmd/root.go:576
  • Issue: Space before closing quote differs from line 659 pattern
  • Suggestion: Use consistent spacing pattern

Issue #10: Inconsistent struct field alignment

  • Status: Not addressed
  • Location: sim/model_hardware_config.go:19
  • Issue: WeightBytesPerParam field breaks alignment pattern
  • Suggestion: Align with other fields for consistency

Progress: 2/10 issues resolved. Please address remaining issues, prioritizing the 3 medium severity items.

…prove code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.
@susiejojo susiejojo requested a review from sriumcp March 19, 2026 17:55
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ All Review Issues Resolved

Excellent work addressing all 10 review findings! All issues have been successfully resolved:

Resolved Issues

Issue #1 (Medium): Duplicated quantization warning logic - Consolidated into helper function

Issue #2 (Medium): WeightBytesPerParam > BytesPerParam validation - Added warning in at line 415-418

Issue #3 (Medium): Compressed-tensors parsing silent failures - Added debug logging for malformed config_groups structure

Issue #4 (Low): GetModelConfigFromHF function length - Extracted helper function (61 lines), reducing main function complexity

Issue #5 (Low): String-to-int coercion error handling - Added debug logging when bits string-to-int conversion fails

Issue #6 (Low): Magic number 8.0 - Replaced with

Issue #7 (Low): Map iteration semantics - Added comment explaining first-match semantics in compressed-tensors parsing

Issue #8 (Low): Roofline approximation comment - Enhanced with quantitative guidance (~2× overestimate, references h-quantized-roofline)

Issue #9 (Low): Inconsistent string spacing - Fixed spacing in cmd/root.go:576

Issue #10 (Low): Struct field alignment - WeightBytesPerParam field properly aligned in model_hardware_config.go

Code Quality Improvements

  • Better separation of concerns with helper functions
  • Improved error handling and diagnostic logging
  • Enhanced documentation with quantitative guidance
  • Maintained backward compatibility
  • All changes follow project standards (R-rules, INV-invariants)

Status: 10/10 issues resolved. Ready to merge! 🎉

sriumcp

This comment was marked as duplicate.

@sriumcp sriumcp merged commit 319d1b7 into main Mar 19, 2026
4 checks passed
sriumcp pushed a commit that referenced this pull request Mar 20, 2026
* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes #762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion #589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
vishakha-ramani pushed a commit to atantawi/inference-sim that referenced this pull request Mar 20, 2026
…inference-sim#698)

* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR inference-sim#698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR inference-sim#698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
vishakha-ramani pushed a commit to atantawi/inference-sim that referenced this pull request Mar 20, 2026
* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR inference-sim#698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR inference-sim#698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes inference-sim#762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion inference-sim#589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
@susiejojo susiejojo deleted the feat/443-quantized-weight-bytes-per-param branch March 20, 2026 18:17
sriumcp added a commit that referenced this pull request Mar 20, 2026
* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle

Add node/GPU placement, instance lifecycle management, multi-model
routing, and per-model metrics to the cluster simulator.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* chore: add AGENTS.md to .gitignore

- AGENTS.md is a generated file for agent context
- Should not be committed to the repository

* fix(cluster): prevent double-counting in drain redirect policy

Fixes INV-1 conservation violation when DrainRedirect policy re-injects
queued requests. Previously, redirected requests were counted twice in
CompletedRequests: once when initially injected and again when completed
after redirection.

Changes:
- Add Request.Redirected field to track re-injected requests
- Mark requests as redirected in drainRedirect.Drain() before re-injection
- Skip CompletedRequests increment in recordRequestCompletion() for redirected requests
- Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix

The fix ensures INV-1 (request conservation) holds:
  injected = completed + queued + running + dropped + timed_out

Addresses PR #697 review feedback on drain redirect conservation.

* fix: address PR #697 review comments

- Document DrainPolicy as Phase 1C infrastructure
- Fix warm-up request overcounting by checking warmUpRemaining
- Fix drain callback memory leak in MarkNodeTerminated
- Add named constants for lifecycle event priorities
- Clarify drainWait GPU release and swap-remove logic
- Add .bob/ to .gitignore

* Fix warm-up TTFT penalty implementation

- Initialize warmUpRemaining for all instances in backward-compat mode
- Fix indentation in cluster_event.go
- Update warm-up recording to track first N requests
- Adjust test expectations to account for queueing effects

Fixes build and test failures in PR.

* chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory

* feat(latency): decouple quantized weight precision from compute dtype (#698)

* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(sim): add L40S GPU and FP8 compute support (#765)

* feat(latency): decouple quantized weight precision from compute dtype (#443)

Roofline and KV capacity calculations now correctly use quantized weight
precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping
the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations.

- Add WeightBytesPerParam field to ModelConfig with zero-value sentinel
- Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8)
- Add --weight-bytes-per-param CLI flag for manual override
- Update calculateMemoryAccessBytes() and computeModelWeightBytes()
- Add validation in ValidateRooflineConfig
- 18 new tests covering all behavioral contracts

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings

Address 3 IMPORTANT findings from convergence review Round 1:

1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc
   path (R23: code path parity with analytical backend)
2. Add WeightBytesPerParam validation in CalculateKVBlocks public API
3. Warn when quantization_config is present but weight precision could
   not be auto-detected (both analytical and blackbox paths)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): convergence R2 — early flag validation, backend-specific logs

Move --weight-bytes-per-param validation before backend switch to avoid
Fatalf inside the best-effort blackbox block (preserves fall-through
contract). Make weight precision log backend-specific: roofline uses it
for step time; trained-roofline/crossmodel only for KV capacity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt struct/var alignment after WeightBytesPerParam addition

Convergence R3: gofmt -w to re-align ModelConfig struct fields and
cmd/root.go var block after longest field name changed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* style: gofmt const/map alignment in kv_capacity files

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: document MFU calibration approximation for quantized models

Add code comment in rooflineStepTime noting that MFU values were
calibrated against FP16 measurements. For quantized models (W4A16),
reduced weight bandwidth shifts the roofline crossover, producing
conservative estimates — safe for capacity planning.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

- Use strings.EqualFold for FP8 quant_method matching (case-insensitive)
- Add test for FP8 with bits=8 present (verifies bits-first path agrees)
- Document why WeightBytesPerParam > BytesPerParam is accepted in validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat(latency): complete three-tier quantized weight detection, remove CLI flag

Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close
the gap where w8a8 models silently fell back to torch_dtype. Add model name
convention detection (w4a16, fp8) as a second-tier fallback. Remove the
--weight-bytes-per-param CLI flag — all three issue #443 options are now
covered by auto-detection, making the manual override unnecessary.

Detection chain: quantization_config → model name → torch_dtype fallback.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment

C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go)
    and add model name fallback to cmd/replay.go roofline path (was missing).
    Three call sites (root.go blackbox, root.go roofline, replay.go) now
    share identical fallback + logging logic.

C2: Sort config_groups keys before iteration in compressed-tensors
    parsing (INV-6 determinism).

I2: Add string-to-int coercion for quantization_config.bits field
    (handles "bits": "4" from some HF configs).

I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode
    end-to-end test verifying quantized model produces lower decode
    step time than FP16 in memory-bound regime.

I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError
    covering +Inf validation gap.

Minor: Remove double blank line in root.go init().

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* docs: update quantization support language across guides, references, and extension recipes

Five documentation pages still claimed quantization was "not yet modeled" and
recommended blackbox mode, contradicting the three-tier auto-detection merged
in this branch. Addresses PR #698 review comments D1–D5.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(cmd): warn when trained-roofline ignores quantized weight precision (I4)

Trained-roofline hardcodes FP16 bytesPerElement to match its training
pipeline, so WeightBytesPerParam only affects KV capacity, not step time.
Previously the CLI logged "quantized model detected" without mentioning
this limitation, which was misleading. Now emits an explicit warning in
both run and replay paths.

Addresses review item I4 from PR #698 convergence review.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: address cosmetic review comments

- Add bitsPerByte constant to replace magic number 8.0 (issue #6)
- Improve roofline approximation comment with quantitative guidance (issue #8)

* Address PR #698 review feedback: refactor quantization parsing and improve code quality

- Extract quantization parsing into parseQuantizationConfig helper (Issue #4)
- Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1)
- Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2)
- Add debug logging for malformed compressed-tensors config_groups (Issue #3)
- Add error handling for invalid bits string-to-int coercion (Issue #5)
- Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7)
- Fix inconsistent spacing in string concatenation (Issue #9)

All changes maintain backward compatibility and pass existing tests.

* feat(sim): add L40S GPU and FP8 compute support

- Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s)
- Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support
- Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values
- Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support)
- Implement FP8 compute selection in roofline model based on weight precision
- Add comprehensive tests for FP8 compute selection logic

Fixes #762

Co-Authored-By: Claude <[email protected]>

* docs: add MFU justification and validation tests

- Add inline documentation to hardware_config.json linking to Discussion #589
- Add comprehensive MFU validation tests in sim/config_test.go
  - Validates MFU ranges (0 < MFU < 1)
  - Validates MfuDecode < MfuPrefill relationship
  - Tests all GPU types (H100, A100, L40S)
- Update docs/reference/models.md with MFU calibration info box
- All tests pass

Addresses review findings from quick-review

* feat(latency): add TFlopsFP8 validation (R3)

Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig,
following the same optional-field pattern as MemoryGiB: check if non-zero,
then reject NaN/Inf/negative values.

Includes test coverage for NaN and negative TFlopsFP8 cases.

* fix: address review nits

- roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM'
- hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries

* docs(roofline): clarify FP8 compute rate selection logic

Improve comment to explicitly document that == 1.0 is intentional:
- FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100
- Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM

This addresses the review comment about == 1.0 exact equality. The behavior
is correct: only true FP8 models use FP8 tensor cores. W4A16 and other
sub-FP8 formats use FP16 compute after dequantization, as validated by
TestRooflineStepTime_FP8ComputeSelection_EdgeCases.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

* fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics

- sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++.
  The guard caused redirected requests to vanish from INV-1 accounting: source's
  InjectedRequests=0 (drained from WaitQ before completion), destination's
  InjectedRequests=0 (skipped CompletedRequests). Destination is the sole
  completion site so incrementing there preserves conservation.

- cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition.
  Clusters where all failures are routing rejections (no routable instances)
  silently omitted the anomaly summary block (I3 from sriumcp review).

- sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter
  explicit (non-variadic). Prevents call sites from silently passing 0 and
  missing routing rejections. Updated all test call sites to pass 0 explicitly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* revert(ci): remove skip-cache from golangci-lint step

Unrelated to Phase 1A changes. skip-cache: true was added during
development but should not be merged to main.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests

- sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to
  actually seed requests into inst0's WaitQ before drain. Previous version
  used a manual event loop on an empty queue — DrainWaitQueue() returned []
  and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest
  directly + empty workload so Run() doesn't push duplicate arrivals. Also
  adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and
  clusterEvents non-empty after drain.

- sim/request.go: Update Redirected field comment to reflect actual behavior.
  Previous comment said "completion accounting is skipped" — opposite of what
  simulator.go:recordRequestCompletion now does.

- sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry
  for redirected requests before re-injection. Source registered the request
  at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry.
  Destination re-registers on re-enqueue, causing a spurious "duplicate
  request ID" WARN in aggregateMetrics() for every redirected request.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* docs: address sriumcp documentation gaps (GAP 1-4)

GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML
schema to the Policy Bundle section so users can discover and configure
Phase 1A features. Both are YAML-only (no CLI flags). Add a note block
explaining backward compatibility. Update DeploymentConfig row in the
config-to-flag mapping table to note the YAML-only fields.

GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to
"Rejected Requests (Admission)" to match actual CLI output label. Add
new "Rejected Requests (Routing)" row explaining when it fires (no
routable instances — all Loading/Draining) and the remediation action.

GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the
example output block to match printPerModelMetrics which uses m.TTFT.P50.
Add tok/s to the Throughput example line to match actual output format.

GAP 4 — results.md per_model JSON: Add table documenting the per_model
key in --results-path JSON output (omitted when no model tags present),
with field-by-field description.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

---------

Co-authored-by: tantawi <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Dipanwita Guhathakurta <[email protected]>
Co-authored-by: Srinivasan Parthasarathy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roofline: quantized model BytesPerParam detection uses torch_dtype instead of actual quantization bitwidth

2 participants