feat(latency): decouple quantized weight precision from compute dtype by susiejojo · Pull Request #698 · inference-sim/inference-sim

susiejojo · 2026-03-16T22:08:23Z

Summary

Closes #443 — Roofline latency model and KV capacity auto-calculation now correctly handle quantized models (GPTQ, AWQ, FP8).

Problem: BytesPerParam was derived from torch_dtype (the compute dtype), so quantized models like W4A16 reported 2.0 bytes/param instead of 0.5. This caused roofline to overestimate weight bandwidth by 4x and KV capacity to undercount available blocks.
Fix: Added WeightBytesPerParam field to ModelConfig (zero-sentinel = fallback to BytesPerParam), parsed from HF quantization_config. Roofline model_weights and KV capacity computeModelWeightBytes() now use EffectiveWeightBytesPerParam(), while KV cache and activations continue using BytesPerParam.
CLI: New --weight-bytes-per-param flag for manual override when auto-detection is insufficient.

Changes

File	Change
`sim/model_hardware_config.go`	`WeightBytesPerParam` field + `EffectiveWeightBytesPerParam()` method
`sim/latency/config.go`	Parse `quantization_config` + validation
`sim/latency/roofline.go`	Use `EffectiveWeightBytesPerParam()` for `model_weights`
`sim/latency/kv_capacity.go`	Use `EffectiveWeightBytesPerParam()` for weight memory
`cmd/root.go`	`--weight-bytes-per-param` flag + quantization logging
`CLAUDE.md`	Documentation updates

Test plan

3 unit tests for EffectiveWeightBytesPerParam() (sentinel, set, both-zero)
6 tests for quantization_config parsing (GPTQ, FP8, AWQ, no-quant, bits=0, regression)
3 tests for ValidateRooflineConfig (negative, NaN, valid)
3 tests for roofline memory split (W4A16 weights 1/4, KV unchanged, conservation)
3 tests for KV capacity (W4A16 more blocks, monotonicity FP16<FP8<W4, regression)
All existing tests pass (zero regressions)
go build ./... passes
go vet ./... clean

🤖 Generated with Claude Code

susiejojo · 2026-03-17T16:56:54Z

@claude Review this PR

claude · 2026-03-17T16:58:00Z

Claude finished @susiejojo's task in 7m 16s —— View job

PR Review: feat(latency): decouple quantized weight precision from compute dtype

Gather context and understand the request
Read changed files
Check standards compliance (rules.md, invariants.md)
Post review feedback

Summary

This is a well-structured and well-tested PR. The design is sound — using a zero-sentinel for WeightBytesPerParam with EffectiveWeightBytesPerParam() as the single accessor is idiomatic and avoids polluting calling code with nil checks. I found one correctness issue, two minor concerns, and several positive observations.

🔴 Issue: FP8 `quantization_config` parsing doesn't check `bits=8` — only falls back when `bits <= 0`

File: sim/latency/config.go:227–235

if bits > 0 {
    weightBytesPerParam = float64(bits) / 8.0
} else if quantMethod == "fp8" {
    weightBytesPerParam = 1.0
}

The bits-first path runs unconditionally for any bits > 0. For an FP8 model that also includes "bits": 8, both branches would agree (8/8 = 1.0). However, if a future FP8 config sets "bits": 4 (e.g., a W4 model that accidentally also sets quant_method: "fp8"), the bits branch wins and the quant_method == "fp8" fallback is skipped. This is probably fine in practice today but means the precedence is implicit. More importantly: this also means that for FP8 models with "bits": 8, the bits path produces the correct 1.0 — but only coincidentally. Consider making this explicit:

if bits > 0 {
    weightBytesPerParam = float64(bits) / 8.0
} else if strings.EqualFold(quantMethod, "fp8") {
    weightBytesPerParam = 1.0
}

The current test (TestGetModelConfig_FP8_ParsesQuantizationConfig) only exercises the bits-absent path, so it doesn't catch the interaction case. This is the only logic gap I could identify. Fix this →

🟡 Minor: `ValidateRooflineConfig` doesn't validate `WeightBytesPerParam > BytesPerParam`

File: sim/latency/config.go:329–337

The function validates WeightBytesPerParam is positive when set, but doesn't warn or error when WeightBytesPerParam > BytesPerParam. A quantized weight precision that exceeds the compute dtype (e.g., WeightBytesPerParam=4.0 with BytesPerParam=2.0) would silently produce inflated roofline estimates. This isn't blocked by the issue spec but is a footgun worth a comment at minimum:

// WeightBytesPerParam > BytesPerParam is unusual but not invalid (e.g., INT4 KV cache
// with FP32 weights). No upper-bound check is enforced here.

Fix this →

🟡 Minor: JSON tag `weight_bytes_per_param,omitempty` on `ModelConfig`

File: sim/model_hardware_config.go:19

WeightBytesPerParam float64 `json:"weight_bytes_per_param,omitempty"`

The zero-sentinel pattern relies on 0 meaning "not set." But omitempty on float64 omits the field when the value is 0. This is internally consistent — the field won't round-trip through JSON with a value of 0 — but could confuse consumers of any serialized ModelConfig output who expect to see weight_bytes_per_param: 0 explicitly. This is a documentation/contract issue, not a functional bug, since no serialized ModelConfig is consumed at runtime in BLIS today.

✅ Positive observations

Design:

Zero-sentinel for WeightBytesPerParam with EffectiveWeightBytesPerParam() as the single accessor is clean. All callers go through one point, so changing semantics later requires changing one function.
The ValidateRooflineConfig and CalculateKVBlocks validation for WeightBytesPerParam is consistent: both use the same != 0 outer guard + < 0 || IsNaN || IsInf inner check (no duplication, no divergence).
computeModelWeightBytes correctly uses EffectiveWeightBytesPerParam() for weight memory, while perTokenKVBytesF correctly uses BytesPerParam directly — the separation of concerns is explicit.

CLI flag handling:

Early validation at cmd/root.go:600–603 catches invalid --weight-bytes-per-param before backend-specific paths. The comment (R3) correctly ties this to the antipattern rules.
The cmd.Flags().Changed() guard is correct — 0 is the default, and using Changed() avoids the ambiguity of 0 meaning both "not set" and "zero value."
The per-backend logging at cmd/root.go:678–681 correctly distinguishes roofline (where weight precision affects step time) from crossmodel/trained-roofline (where it only affects KV capacity).

Tests:

BDD-style GIVEN/WHEN/THEN naming is followed throughout.
Invariant tests (monotonicity, conservation, regression anchor) complement the value-matching tests well — this aligns with the principles.md "Test laws, not just values" rule.
TestCalculateKVBlocks_W4A16_PerTokenKVBytesUnchanged verifies the key separation: KV bytes use BytesPerParam while weight memory uses WeightBytesPerParam. This is the most important behavioral invariant of the feature.
The TestGetModelConfig_NonQuantized_BytesPerParamUnchanged regression anchor at config_test.go:913 correctly calls EffectiveWeightBytesPerParam() for the fallback path, not just the raw field.

Standards compliance:

R4 (canonical constructors): NewKVCapacityParams is not changed and the new field is only on ModelConfig, which already had struct-literal construction. No new construction sites introduced.
R8 (unexported mutable maps): swiGLUActivations stays unexported, no new exported maps introduced.
INV-9 (oracle knowledge boundary): WeightBytesPerParam is used only for weight bandwidth and memory estimation — not in request scheduling or routing. No INV-9 violation.

Net Assessment

Approve with minor items. The correctness issue with FP8 parsing is low-risk (the test coverage for the current behavior is correct), and the two minor items are documentation/footgun concerns. The core logic is sound, the test suite is comprehensive, and the feature correctly resolves the issue #443 root cause.

susiejojo · 2026-03-17T17:18:10Z

@claude Have your review concerns from earlier been fixed now?

claude · 2026-03-17T17:27:41Z

Claude finished @susiejojo's task in 51s —— View job

Review follow-up: all three concerns addressed ✅

Gather context and read current code
Check 🔴 FP8 case-insensitive match + test coverage
Check 🟡 WeightBytesPerParam > BytesPerParam validation comment
Check 🟡 omitempty JSON tag concern

Yes, all three concerns from the previous review have been resolved in the latest commit (d24dec5).

🔴 FP8 case-insensitive match — Fixed

config.go:232 now uses strings.EqualFold:

} else if strings.EqualFold(quantMethod, "fp8") {
    weightBytesPerParam = 1.0
}

And a new test TestGetModelConfig_FP8WithBits8_ParsesViaBitsPath (config_test.go:833) was added to cover the FP8+bits=8 interaction case, explicitly documenting that both code paths produce 1.0.

🟡 `WeightBytesPerParam > BytesPerParam` validation — Fixed

config.go:329–332 now has the requested comment:

// WeightBytesPerParam is optional (0 = not set, fall back to BytesPerParam).
// When set, it must be a valid positive number. No upper-bound check is enforced:
// WeightBytesPerParam > BytesPerParam is unusual but not invalid (e.g., INT4 KV cache
// with FP32 weights). Callers should not assume weight precision <= compute precision.

🟡 `omitempty` JSON tag — Acknowledged (no change needed)

This was flagged as a documentation/contract concern rather than a functional bug. The omitempty tag on WeightBytesPerParam in model_hardware_config.go:19 is still present — which is correct, since 0 means "not set" and omitting it in JSON output is the right behavior for the zero-sentinel pattern. No code change was needed here.

Net assessment: approve. The correctness fix and test coverage are solid. The PR is ready to merge.

susiejojo · 2026-03-17T21:41:14Z

Follow-up: complete three-tier detection, remove `--weight-bytes-per-param` flag

Based on issue #443 comments discussion, this commit addresses all three proposed detection options and simplifies the CLI:

What changed:

Added compressed-tensors parsing (option 1 gap fix) — The original quantization_config parser handled GPTQ/AWQ (bits) and FP8 (implicit) but missed compressed-tensors, which stores precision in config_groups.*.weights.num_bits. Models like RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8 (quant_method: "compressed-tensors") were silently falling back to torch_dtype (2.0 instead of 1.0).
Added model name detection (option 2) — InferWeightBytesFromModelName regex-matches HF naming conventions (w4a16 → 0.5, w8a8 → 1.0, FP8 → 1.0) as a fallback when quantization_config parsing doesn't yield a result.
Removed --weight-bytes-per-param flag (option 3 → no longer needed) — With options 1 and 2 covering all HF quantization formats, the manual override has no practical use case. Removes ~35 lines of flag plumbing (variable, registration, validation, Changed() guards, warning messages referencing the flag).

Detection chain: quantization_config → model name conventions → torch_dtype fallback

Tests added: compressed-tensors W8A8, compressed-tensors W4, compressed-tensors missing config_groups (graceful degradation), 10-case table-driven model name detection test.

Verified: go build ./... ✓, go test ./... ✓ (all packages pass), go vet ./... ✓

sriumcp

PR Review: Quantized Weight Precision Detection

The PR correctly addresses all explicit concerns from issue #443 — the three-tier detection chain covers all 4 quantization formats, roofline weight bandwidth is corrected, and KV capacity accounts for reduced weight memory. Test quality is excellent (BC-1 through BC-10 contracts, monotonicity/conservation invariant tests, regression anchors).

However, 2 critical issues and 4 important issues should be resolved before merge.

Critical Issues

C1. `cmd/replay.go` not updated — R23 code path parity violation

cmd/replay.go:345-358 has a parallel code path that loads HF config and constructs modelConfig for roofline/crossmodel/trained-roofline backends. This path is not updated with:

The InferWeightBytesFromModelName fallback
The new quantization detection info/warning logging
The stale warning at line 357 still reads "possible quantization" instead of the new diagnostic messages

Impact: blis replay with a quantized model (e.g. --model redhatai/llama-3.3-70b-instruct-quantized.w4a16) will still overestimate weight memory by up to 4x.

Fix: Extract a shared helper function (e.g. applyWeightPrecisionFallback(mc *sim.ModelConfig, modelName string, hfCfg *latency.HFConfig)) that performs model-name inference, logs quantization info, and warns on undetermined precision. Call it from all three sites:

cmd/root.go blackbox auto-fetch path (~line 608)
cmd/root.go roofline/crossmodel/trained-roofline path (~line 665)
cmd/replay.go (~line 350)

This also resolves the DRY violation (see I1 below).

C2. Non-deterministic map iteration in compressed-tensors parsing — INV-6 violation

sim/latency/config.go — the for _, g := range cg loop over config_groups breaks on the first group with weights.num_bits > 0. If a compressed-tensors config has heterogeneous groups (e.g. group_0 at 4-bit, group_1 at 8-bit), Go's non-deterministic map iteration means different runs could pick different groups, producing different WeightBytesPerParam values.

Impact: Violates INV-6 (determinism: same seed produces byte-identical output).

Fix: Sort the map keys before iterating:

if cg, ok := qc["config_groups"].(map[string]any); ok {
    keys := make([]string, 0, len(cg))
    for k := range cg {
        keys = append(keys, k)
    }
    sort.Strings(keys)
    for _, k := range keys {
        g := cg[k]
        // ... existing nested type assertions with break ...
    }
}

Important Issues

I1. DRY violation: duplicated model-name-fallback in `cmd/root.go`

The InferWeightBytesFromModelName call + quantization_config warning logic is copy-pasted in two code paths within cmd/root.go:

Blackbox auto-fetch path (~lines 608-618): missing the "quantized model detected" info log
Roofline path (~lines 665-684): has the full logging

Fix: Resolved by C1's shared helper extraction.

I2. Silent type coercion failure in `bits` field

sim/latency/config.go — if a HF config has "bits": "4" (string instead of float64), the type assertion qc["bits"].(float64) silently fails, bits stays 0, and GPTQ/AWQ models fall through without detecting weight precision. The model name fallback in cmd/root.go would likely catch it, but parsing should be more robust.

Fix: Add string-to-int coercion and log unexpected types:

if bitsRaw, ok := qc["bits"].(float64); ok {
    bits = int(bitsRaw)
} else if bitsStr, ok := qc["bits"].(string); ok {
    if parsed, err := strconv.Atoi(bitsStr); err == nil {
        bits = parsed
    }
} else if qc["bits"] != nil {
    logrus.Warnf("quantization_config.bits has unexpected type %T; skipping", qc["bits"])
}

I3. No `rooflineStepTime` end-to-end test for quantized models

Tests verify calculateMemoryAccessBytes produces correct weight memory reduction (0.25x for W4A16), but no test calls rooflineStepTime with a quantized config and verifies lower step latency. A bug in how weight bytes interact with the roofline max(compute, memory) crossover would not be caught.

Fix: Add a test in sim/latency/roofline_test.go:

func TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode(t *testing.T) {
    // Memory-bound decode: smaller weight bandwidth produces faster step time
    fp16 := testModelConfig()   // BytesPerParam=2.0, WeightBytesPerParam=0
    w4a16 := testW4A16Config()  // BytesPerParam=2.0, WeightBytesPerParam=0.5
    hw := testHWConfig()
    step := StepConfig{NumTokens: 1, TotalKV: 64, IncludeKVCache: true} // decode step

    fp16Time := rooflineStepTime(fp16, hw, step, 1)
    w4Time := rooflineStepTime(w4a16, hw, step, 1)

    if w4Time >= fp16Time {
        t.Errorf("W4A16 decode should be faster (less weight bandwidth): W4=%d >= FP16=%d", w4Time, fp16Time)
    }
}

I4. Missing `Inf` test for `ValidateRooflineConfig.WeightBytesPerParam`

The validation checks for math.IsInf but there's no corresponding test in config_test.go (only negative and NaN are tested). CalculateKVBlocks tests cover Inf separately, but ValidateRooflineConfig is a separate entry point.

Fix: Add in sim/latency/config_test.go:

func TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError(t *testing.T) {
    mc := sim.ModelConfig{NumHeads: 32, NumLayers: 32, HiddenDim: 4096, BytesPerParam: 2, WeightBytesPerParam: math.Inf(1)}
    hc := sim.HardwareCalib{TFlopsPeak: 1000, BwPeakTBs: 3.35, MfuPrefill: 0.5, MfuDecode: 0.3}
    err := latency.ValidateRooflineConfig(mc, hc)
    if err == nil {
        t.Fatal("expected error for Inf WeightBytesPerParam")
    }
    if !strings.Contains(err.Error(), "WeightBytesPerParam") {
        t.Errorf("error should mention WeightBytesPerParam, got: %v", err)
    }
}

Minor

Double blank line in cmd/root.go init() (~line 1266): remove one of the two consecutive blank lines after the workload type flag.
Style consistency: new tests use t.Errorf with manual if-checks while existing tests in sim/config_test.go use testify assert. Not a blocker but worth aligning.

sriumcp · 2026-03-17T22:04:27Z

Documentation Gaps

The PR updated CLAUDE.md but missed 3 user-facing docs that now contain stale "not supported" statements directly contradicting what this PR adds. These should be fixed in this PR since they will mislead users of quantized models.

D1. `docs/guide/latency-models.md:213` — Stale "Current limitations" warning

Currently says:

quantization effects (FP8, W4A16, W8A8) are not yet modeled in the analytical backends. These are under active development. For quantized deployments, blackbox mode with calibrated coefficients provides the most accurate results today.

Fix: Replace the quantization portion with something like:

Quantized weight precision (FP8, W4A16, W8A8) is auto-detected from quantization_config, model name conventions, or torch_dtype fallback — weight bandwidth and KV capacity calculations use the detected precision. MFU values are still calibrated against FP16 measurements, so step time estimates for quantized models may be conservative (see roofline.go known-approximation comment). DP and EP are not yet modeled.

D2. `docs/concepts/roofline.md:9-10` — Stale "Scope" note

Currently says:

Scope: TP-only, FP16/BF16

and:

does not model data parallelism (DP), expert parallelism (EP), or quantization effects (FP8, W4A16, W8A8)

Fix: Update the scope line to "Scope: TP-only, quantized weight memory supported" and replace the quantization portion:

Quantized weight precision is now modeled for memory bandwidth and KV capacity. The roofline model auto-detects weight precision from HuggingFace quantization_config (GPTQ, AWQ, FP8, compressed-tensors), model name conventions (w4a16, FP8), or falls back to torch_dtype. MFU calibration still assumes FP16 compute. DP and EP are not yet modeled.

D3. `docs/reference/models.md:47-48` — Stale "Parallelism and quantization" note

Currently says:

quantization effects (FP8, W4A16, W8A8) are under active development. For quantized model deployments, use blackbox mode with calibrated coefficients for best accuracy.

Fix: Replace with:

Quantized weight precision (FP8, W4A16, W8A8) is auto-detected and used for weight bandwidth and KV capacity calculations in all analytical backends. MFU values are calibrated against FP16 measurements — step time estimates for quantized models may be slightly conservative. DP and EP are not yet modeled.

D4. `docs/reference/project-structure.md:42` — Missing `WeightBytesPerParam` in file annotation

The model_hardware_config.go line should mention the new field. Fix: Update to:

│   ├── model_hardware_config.go # ModelConfig (including WeightBytesPerParam for quantized weight precision, with EffectiveWeightBytesPerParam() accessor), HardwareCalib structs ...

D5. (Optional) `docs/contributing/extension-recipes.md` — No recipe for adding a quantization format

If someone wants to add support for bitsandbytes (bnb) or HQQ, they need to know:

Add a case in GetModelConfigFromHF() in sim/latency/config.go (the quantization_config parsing block)
Optionally add a regex pattern in InferWeightBytesFromModelName() for model name detection
Add tests covering the new format in sim/latency/config_test.go

A short "Adding a quantization format" entry in extension recipes would make this discoverable. Lower priority than D1-D4.

susiejojo · 2026-03-17T22:12:15Z

Review fixes addressed (commit `88c548b`)

Critical

C1: Extracted applyWeightPrecisionFallback shared helper into cmd/hfconfig.go. All three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. cmd/replay.go roofline path was missing model name fallback entirely — now fixed.
C2: Sorted config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism).

Important

I2: Added string-to-int coercion for quantization_config.bits field — handles "bits": "4" (string) from some HF configs via strconv.Atoi fallback.
I3: Added TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode — end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime (ratio observed: 0.27x).
I4: Added TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap.

Minor

Removed double blank line in root.go init().

All tests pass (go test ./...), go vet clean.

… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>

sriumcp · 2026-03-18T15:40:15Z

Convergence Review — Round 1 (pr-code gate, sonnet model)

Branch: pr-698 | Commit: c5555ce | Perspectives: 10/10 complete

Result: NOT CONVERGED — 0 CRITICAL, 14 IMPORTANT, ~28 SUGGESTION

IMPORTANT Findings (14)

I1: Non-deterministic map iteration in compressed-tensors parsing (INV-6 violation)

Location: sim/latency/config.go:181 | Raised by: 7/10 perspectives

Sort config group keys before iterating.

I2: Stale documentation — 3 pages say "quantization not yet modeled"

Locations: docs/guide/latency-models.md:213, docs/concepts/roofline.md:9-10, docs/reference/models.md:47-48

Update all three to reflect auto-detection feature.

I3: DRY violation — duplicate model-name fallback + warning in two CLI paths

Location: cmd/root.go:605-617 and cmd/root.go:660-684

Extract shared helper.

I4: Trained-roofline backend silently ignores `WeightBytesPerParam`

Location: sim/latency/trained_roofline.go:151-153

Document limitation or add log warning.

I5: No upper-bound check on `bits` or `WeightBytesPerParam`

Location: sim/latency/config.go:171,264

Add bits > 0 && bits <= 64 guard.

I6: FP8 KV cache dtype not modeled

Location: sim/latency/roofline.go:159,165

Document as known limitation.

I7: GPTQ/AWQ group quantization overhead (~6%) ignored

Location: sim/latency/config.go:237-240

Document as known approximation.

I8: "Conservative (pessimistic)" comment is inaccurate

Location: sim/latency/roofline.go:200-206

Soften to "may be inaccurate".

I9: Duplicate `WeightBytesPerParam` validation (DRY)

Location: sim/latency/config.go:263 and kv_capacity.go:696

Centralize validation.

I10: No CLI flag for manual `WeightBytesPerParam` override

Location: cmd/root.go

Add --weight-bytes-per-param flag.

I11: No validated quantized model example in docs

Location: docs/guide/latency-models.md:72-78

Add W4A16 or FP8 example.

I12: Misleading quantization warning in blackbox path

Location: cmd/root.go:608

Suppress or clarify warning.

I13: Pre-existing `FractionalBytesPerParam` test semantically misleading

Location: sim/latency/kv_capacity_test.go:414

Update comment or convert to WeightBytesPerParam.

I14: Tests couple to internal map keys of unexported function

Location: sim/latency/roofline_test.go:1036,1068

Test through public API instead.

…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>

… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo · 2026-03-18T16:57:16Z

Review Comments Addressed

From #issuecomment-4078259809 (D1–D5: Documentation gaps)

All five addressed in 1a60346:

D1 docs/guide/latency-models.md — Removed "not yet modeled" language for quantization; now describes auto-detection and notes MFU calibration caveat
D2 docs/concepts/roofline.md — Updated scope from "TP-only, FP16/BF16" to "TP-only, quantized weight memory supported"; describes three-tier detection
D3 docs/reference/models.md — Removed blackbox recommendation for quantized models; lists supported formats (GPTQ, AWQ, FP8, compressed-tensors)
D4 docs/reference/project-structure.md — Added WeightBytesPerParam and EffectiveWeightBytesPerParam() to model_hardware_config.go annotation
D5 docs/contributing/extension-recipes.md — New "Adding New Quantization Formats" recipe with three-tier detection order and code pointers

From #issuecomment-4083542833 (Convergence review round 1)

Fixed:

#	Issue	Status
I1	Map iteration non-determinism (compressed-tensors)	Already fixed — `sort.Strings(keys)` with INV-6 comment
I2	Stale documentation (3 pages)	Fixed in D1–D3 above
I3	CLI code duplication	Already fixed — `applyWeightPrecisionFallback()` in `cmd/hfconfig.go`
I4	Trained-roofline silently ignores `WeightBytesPerParam`	Fixed in `e5dec14` — emits warning in both `run` and `replay` paths explaining FP16 weight bandwidth is used for step time while quantized precision only affects KV capacity

Intentional design (no change):

#	Issue	Rationale
I5	No upper-bound on `bits`	Garbage-in scenario; validation comment at config.go:382 documents the design choice
I9	Validation duplication	Defense-in-depth — `ValidateRooflineConfig` and `CalculateKVBlocks` are independent entry points
I10	No `--weight-bytes-per-param` CLI flag	Three-tier auto-detection covers standard formats; manual override deferred as follow-up

…#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]>

…uant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]>

Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…verage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]>

… CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…verage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]>

… and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…on (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]>

sriumcp

Quick-Review Findings

Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.

Medium Severity Issues

1. Duplicated quantization warning logic across files (DRY violation)

Location: cmd/root.go:655-662, cmd/replay.go:358-361
Issue: The function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
Suggestion: Consider consolidating the warning logic into or a shared helper function to maintain DRY principle.

2. WeightBytesPerParam > BytesPerParam case not validated

Location: sim/latency/config.go:385-390
Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is 'unusual but not invalid', this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.

3. Compressed-tensors parsing silently fails on malformed config_groups

Location: sim/latency/config.go:345-365
Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
Suggestion: Add debug logging when config_groups structure doesn't match expected schema.

Low Severity Issues

4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)

Location: sim/latency/config.go:195-377 (182 lines)
Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
Suggestion: Extract quantization parsing into a separate function: parseQuantizationConfig(qc map[string]any) float64

5. String-to-int coercion for bits field lacks error handling

Location: sim/latency/config.go:329-335
Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like 'abc' would silently result in weightBytesPerParam=0 without warning.
Suggestion: Log a warning when string-to-int coercion fails.

6. Magic number 8.0 for bits-to-bytes conversion

Location: sim/latency/config.go:327, 342, 365
Issue: Division by 8.0 appears multiple times without a named constant.
Suggestion: Define const bitsPerByte = 8.0 at package level.

7. Map iteration order in compressed-tensors config_groups

Location: sim/latency/config.go:349-364
Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.

8. Roofline approximation comment could be more specific

Location: sim/latency/roofline.go:200-203
Issue: Comment describes a 'known approximation' for quantized models but doesn't quantify the expected error magnitude.
Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.

9. Inconsistent spacing in concatenated string

Location: cmd/root.go:576
Issue: Inconsistent spacing in string concatenation compared to line 659.
Suggestion: Use consistent spacing pattern.

10. Inconsistent struct field alignment

Location: sim/model_hardware_config.go:19
Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
Suggestion: Align field with others for consistency.

All findings have been added to Bob Findings panel for detailed review.

sriumcp

Quick-Review Findings

Completed systematic review using 10 PR code perspectives (PC-1 through PC-10). Found 10 issues: 3 medium severity, 7 low severity.

Medium Severity Issues

1. Duplicated quantization warning logic across files (DRY violation)

Location: cmd/root.go:655-662, cmd/replay.go:358-361
Issue: The applyWeightPrecisionFallback function is called in three places with similar warning patterns. The trained-roofline warning appears twice with identical text.
Suggestion: Consider consolidating the warning logic into applyWeightPrecisionFallback or a shared helper function to maintain DRY principle.

2. WeightBytesPerParam > BytesPerParam case not validated

Location: sim/latency/config.go:385-390
Issue: The validation allows WeightBytesPerParam > BytesPerParam without warning. While the comment acknowledges this is unusual but not invalid, this could indicate a configuration error (e.g., INT4 KV cache with FP32 weights).
Suggestion: Add a warning log when WeightBytesPerParam > BytesPerParam to alert users of unusual configuration.

3. Compressed-tensors parsing silently fails on malformed config_groups

Location: sim/latency/config.go:345-365
Issue: If config_groups exists but has unexpected structure (e.g., weights is not a map), the code silently continues without logging. Users with malformed compressed-tensors configs won't know why quantization wasn't detected.
Suggestion: Add debug logging when config_groups structure doesn't match expected schema.

Low Severity Issues

4. GetModelConfigFromHF function exceeds 50 lines (R14 violation)

Location: sim/latency/config.go:195-377 (182 lines)
Issue: Function handles multiple concerns: base config parsing, MoE detection, quantization config parsing (GPTQ/AWQ/FP8/compressed-tensors), and struct construction.
Suggestion: Extract quantization parsing into a separate function: parseQuantizationConfig(qc map[string]any) float64

5. String-to-int coercion for bits field lacks error handling

Location: sim/latency/config.go:329-335
Issue: When bits is a string, strconv.Atoi error is silently ignored. Invalid string values like abc would silently result in weightBytesPerParam=0 without warning.
Suggestion: Log a warning when string-to-int coercion fails.

6. Magic number 8.0 for bits-to-bytes conversion

Location: sim/latency/config.go:327, 342, 365
Issue: Division by 8.0 appears multiple times without a named constant.
Suggestion: Define const bitsPerByte = 8.0 at package level.

7. Map iteration order in compressed-tensors config_groups

Location: sim/latency/config.go:349-364
Issue: Code iterates over config_groups map and breaks on first valid num_bits. While keys are sorted (INV-6 compliance), the comment doesn't explain why first-match is correct.
Suggestion: Add comment explaining first-match semantics or validate that all groups have same num_bits.

8. Roofline approximation comment could be more specific

Location: sim/latency/roofline.go:200-203
Issue: Comment describes a known approximation for quantized models but doesn't quantify the expected error magnitude.
Suggestion: Add quantitative guidance or reference a hypothesis/benchmark.

9. Inconsistent spacing in concatenated string

Location: cmd/root.go:576
Issue: Inconsistent spacing in string concatenation compared to line 659.
Suggestion: Use consistent spacing pattern.

10. Inconsistent struct field alignment

Location: sim/model_hardware_config.go:19
Issue: New WeightBytesPerParam field breaks the previous alignment pattern.
Suggestion: Align field with others for consistency.

All findings have been added to Bob Findings panel for detailed review.

- Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8)

sriumcp

Follow-up Review: Remaining Issues

Thank you for addressing issues #6 (bitsPerByte constant) and #8 (roofline approximation comment). However, 8 issues remain unaddressed:

Medium Severity (Still Open)

Issue #1: Duplicated quantization warning logic (DRY violation)

Status: Not addressed
Location: cmd/root.go:655-662, cmd/replay.go:358-361
Issue: The trained-roofline warning still appears twice with identical text
Suggestion: Consolidate warning logic into applyWeightPrecisionFallback or shared helper

Issue #2: WeightBytesPerParam > BytesPerParam case not validated

Status: Not addressed
Location: sim/latency/config.go:383-393
Issue: Comment acknowledges this is unusual but no warning is logged when it occurs
Suggestion: Add logrus.Warnf when WeightBytesPerParam > BytesPerParam to alert users

Issue #3: Compressed-tensors parsing silently fails on malformed config_groups

Status: Not addressed
Location: sim/latency/config.go:345-365
Issue: Malformed config_groups structure fails silently without logging
Suggestion: Add logrus.Debugf when config_groups structure doesn't match expected schema

Low Severity (Still Open)

Issue #4: GetModelConfigFromHF function exceeds 50 lines (R14 violation)

Status: Not addressed
Location: sim/latency/config.go:195-377 (182 lines)
Suggestion: Extract quantization parsing: parseQuantizationConfig(qc map[string]any) float64

Issue #5: String-to-int coercion for bits field lacks error handling

Status: Not addressed
Location: sim/latency/config.go:329-335
Issue: Invalid string values like abc silently result in weightBytesPerParam=0
Suggestion: Log warning when strconv.Atoi fails

Issue #7: Map iteration order in compressed-tensors config_groups

Status: Not addressed
Location: sim/latency/config.go:349-364
Issue: First-match semantics not explained in comments
Suggestion: Add comment explaining why first valid num_bits is used

Issue #9: Inconsistent spacing in concatenated string

Status: Not addressed
Location: cmd/root.go:576
Issue: Space before closing quote differs from line 659 pattern
Suggestion: Use consistent spacing pattern

Issue #10: Inconsistent struct field alignment

Status: Not addressed
Location: sim/model_hardware_config.go:19
Issue: WeightBytesPerParam field breaks alignment pattern
Suggestion: Align with other fields for consistency

Progress: 2/10 issues resolved. Please address remaining issues, prioritizing the 3 medium severity items.

…prove code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests.

sriumcp

✅ All Review Issues Resolved

Excellent work addressing all 10 review findings! All issues have been successfully resolved:

Resolved Issues

✅ Issue #1 (Medium): Duplicated quantization warning logic - Consolidated into helper function

✅ Issue #2 (Medium): WeightBytesPerParam > BytesPerParam validation - Added warning in at line 415-418

✅ Issue #3 (Medium): Compressed-tensors parsing silent failures - Added debug logging for malformed config_groups structure

✅ Issue #4 (Low): GetModelConfigFromHF function length - Extracted helper function (61 lines), reducing main function complexity

✅ Issue #5 (Low): String-to-int coercion error handling - Added debug logging when bits string-to-int conversion fails

✅ Issue #6 (Low): Magic number 8.0 - Replaced with

✅ Issue #7 (Low): Map iteration semantics - Added comment explaining first-match semantics in compressed-tensors parsing

✅ Issue #8 (Low): Roofline approximation comment - Enhanced with quantitative guidance (~2× overestimate, references h-quantized-roofline)

✅ Issue #9 (Low): Inconsistent string spacing - Fixed spacing in cmd/root.go:576

✅ Issue #10 (Low): Struct field alignment - WeightBytesPerParam field properly aligned in model_hardware_config.go

Code Quality Improvements

Better separation of concerns with helper functions
Improved error handling and diagnostic logging
Enhanced documentation with quantitative guidance
Maintained backward compatibility
All changes follow project standards (R-rules, INV-invariants)

Status: 10/10 issues resolved. Ready to merge! 🎉

* feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

…inference-sim#698) * feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR inference-sim#698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR inference-sim#698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(latency): decouple quantized weight precision from compute dtype (inference-sim#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue inference-sim#443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR inference-sim#698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR inference-sim#698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR inference-sim#698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue inference-sim#1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes inference-sim#762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion inference-sim#589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

* feat(infra): Phase 1A — nodes, GPUs, and instance lifecycle Add node/GPU placement, instance lifecycle management, multi-model routing, and per-model metrics to the cluster simulator. Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore: add AGENTS.md to .gitignore - AGENTS.md is a generated file for agent context - Should not be committed to the repository * fix(cluster): prevent double-counting in drain redirect policy Fixes INV-1 conservation violation when DrainRedirect policy re-injects queued requests. Previously, redirected requests were counted twice in CompletedRequests: once when initially injected and again when completed after redirection. Changes: - Add Request.Redirected field to track re-injected requests - Mark requests as redirected in drainRedirect.Drain() before re-injection - Skip CompletedRequests increment in recordRequestCompletion() for redirected requests - Add TestInstanceLifecycle_RedirectDrainPreservesConservation to verify fix The fix ensures INV-1 (request conservation) holds: injected = completed + queued + running + dropped + timed_out Addresses PR #697 review feedback on drain redirect conservation. * fix: address PR #697 review comments - Document DrainPolicy as Phase 1C infrastructure - Fix warm-up request overcounting by checking warmUpRemaining - Fix drain callback memory leak in MarkNodeTerminated - Add named constants for lifecycle event priorities - Clarify drainWait GPU release and swap-remove logic - Add .bob/ to .gitignore * Fix warm-up TTFT penalty implementation - Initialize warmUpRemaining for all instances in backward-compat mode - Fix indentation in cluster_event.go - Update warm-up recording to track first N requests - Adjust test expectations to account for queueing effects Fixes build and test failures in PR. * chore: update .gitignore to ignore .bob/notes instead of entire .bob/ directory * feat(latency): decouple quantized weight precision from compute dtype (#698) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * feat(sim): add L40S GPU and FP8 compute support (#765) * feat(latency): decouple quantized weight precision from compute dtype (#443) Roofline and KV capacity calculations now correctly use quantized weight precision (e.g. 0.5 bytes/param for W4A16) for weight memory while keeping the compute dtype (e.g. 2.0 for bfloat16) for KV cache and activations. - Add WeightBytesPerParam field to ModelConfig with zero-value sentinel - Parse quantization_config from HuggingFace config.json (GPTQ, AWQ, FP8) - Add --weight-bytes-per-param CLI flag for manual override - Update calculateMemoryAccessBytes() and computeModelWeightBytes() - Add validation in ValidateRooflineConfig - 18 new tests covering all behavioral contracts Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence R1 — blackbox path parity, KV validation, quant warnings Address 3 IMPORTANT findings from convergence review Round 1: 1. Apply --weight-bytes-per-param CLI override in blackbox KV auto-calc path (R23: code path parity with analytical backend) 2. Add WeightBytesPerParam validation in CalculateKVBlocks public API 3. Warn when quantization_config is present but weight precision could not be auto-detected (both analytical and blackbox paths) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): convergence R2 — early flag validation, backend-specific logs Move --weight-bytes-per-param validation before backend switch to avoid Fatalf inside the best-effort blackbox block (preserves fall-through contract). Make weight precision log backend-specific: roofline uses it for step time; trained-roofline/crossmodel only for KV capacity. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt struct/var alignment after WeightBytesPerParam addition Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]> * style: gofmt const/map alignment in kv_capacity files Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: document MFU calibration approximation for quantized models Add code comment in rooflineStepTime noting that MFU values were calibrated against FP16 measurements. For quantized models (W4A16), reduced weight bandwidth shifts the roofline crossover, producing conservative estimates — safe for capacity planning. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment - Use strings.EqualFold for FP8 quant_method matching (case-insensitive) - Add test for FP8 with bits=8 present (verifies bits-first path agrees) - Document why WeightBytesPerParam > BytesPerParam is accepted in validation Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): complete three-tier quantized weight detection, remove CLI flag Add compressed-tensors parsing (config_groups.*.weights.num_bits) to close the gap where w8a8 models silently fell back to torch_dtype. Add model name convention detection (w4a16, fp8) as a second-tier fallback. Remove the --weight-bytes-per-param CLI flag — all three issue #443 options are now covered by auto-detection, making the manual override unnecessary. Detection chain: quantization_config → model name → torch_dtype fallback. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR review — FP8 case-insensitive match, test coverage, validation comment C1: Extract applyWeightPrecisionFallback shared helper (cmd/hfconfig.go) and add model name fallback to cmd/replay.go roofline path (was missing). Three call sites (root.go blackbox, root.go roofline, replay.go) now share identical fallback + logging logic. C2: Sort config_groups keys before iteration in compressed-tensors parsing (INV-6 determinism). I2: Add string-to-int coercion for quantization_config.bits field (handles "bits": "4" from some HF configs). I3: Add TestRooflineStepTime_W4A16_LowerThanFP16_MemoryBoundDecode end-to-end test verifying quantized model produces lower decode step time than FP16 in memory-bound regime. I4: Add TestValidateRooflineConfig_InfWeightBytesPerParam_ReturnsError covering +Inf validation gap. Minor: Remove double blank line in root.go init(). Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: update quantization support language across guides, references, and extension recipes Five documentation pages still claimed quantization was "not yet modeled" and recommended blackbox mode, contradicting the three-tier auto-detection merged in this branch. Addresses PR #698 review comments D1–D5. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): warn when trained-roofline ignores quantized weight precision (I4) Trained-roofline hardcodes FP16 bytesPerElement to match its training pipeline, so WeightBytesPerParam only affects KV capacity, not step time. Previously the CLI logged "quantized model detected" without mentioning this limitation, which was misleading. Now emits an explicit warning in both run and replay paths. Addresses review item I4 from PR #698 convergence review. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address cosmetic review comments - Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8) * Address PR #698 review feedback: refactor quantization parsing and improve code quality - Extract quantization parsing into parseQuantizationConfig helper (Issue #4) - Consolidate duplicated trained-roofline warning into warnTrainedRooflineQuantization helper (Issue #1) - Add validation warning when WeightBytesPerParam > BytesPerParam (Issue #2) - Add debug logging for malformed compressed-tensors config_groups (Issue #3) - Add error handling for invalid bits string-to-int coercion (Issue #5) - Add comment explaining first-match semantics in compressed-tensors parsing (Issue #7) - Fix inconsistent spacing in string concatenation (Issue #9) All changes maintain backward compatibility and pass existing tests. * feat(sim): add L40S GPU and FP8 compute support - Add L40S hardware configuration (362.05 TFLOPS, 48GB, 0.864 TB/s) - Add TFlopsFP8 field to HardwareCalib for native FP8 tensor core support - Update H100 with TFlopsFP8=1979.0 (2× FP16 rate) and adjusted MFU values - Update A100-SXM and A100-80 with TFlopsFP8=0 (no native FP8 support) - Implement FP8 compute selection in roofline model based on weight precision - Add comprehensive tests for FP8 compute selection logic Fixes #762 Co-Authored-By: Claude <[email protected]> * docs: add MFU justification and validation tests - Add inline documentation to hardware_config.json linking to Discussion #589 - Add comprehensive MFU validation tests in sim/config_test.go - Validates MFU ranges (0 < MFU < 1) - Validates MfuDecode < MfuPrefill relationship - Tests all GPU types (H100, A100, L40S) - Update docs/reference/models.md with MFU calibration info box - All tests pass Addresses review findings from quick-review * feat(latency): add TFlopsFP8 validation (R3) Add validation for HardwareCalib.TFlopsFP8 in ValidateRooflineConfig, following the same optional-field pattern as MemoryGiB: check if non-zero, then reject NaN/Inf/negative values. Includes test coverage for NaN and negative TFlopsFP8 cases. * fix: address review nits - roofline.go:217: Fix comment wording from 'upcasted' to 'dequantized to FP16 during GEMM' - hardware_config.json:8: Restore H100 mfuDecode to 0.30 for consistency with other entries * docs(roofline): clarify FP8 compute rate selection logic Improve comment to explicitly document that == 1.0 is intentional: - FP8 models (exactly 1.0 byte/param) use FP8 compute rate on H100 - Sub-FP8 formats (e.g., W4A16 at 0.5 bytes/param) dequantize to FP16 during GEMM This addresses the review comment about == 1.0 exact equality. The behavior is correct: only true FP8 models use FP8 tensor cores. W4A16 and other sub-FP8 formats use FP16 compute after dequantization, as validated by TestRooflineStepTime_FP8ComputeSelection_EdgeCases. --------- Co-authored-by: Claude Opus 4.6 <[email protected]> * fix(cluster): address sriumcp review — conservation, replay anomaly, non-variadic CollectRawMetrics - sim/simulator.go: Remove incorrect `if !req.Redirected` guard on CompletedRequests++. The guard caused redirected requests to vanish from INV-1 accounting: source's InjectedRequests=0 (drained from WaitQ before completion), destination's InjectedRequests=0 (skipped CompletedRequests). Destination is the sole completion site so incrementing there preserves conservation. - cmd/replay.go: Add `|| rawMetrics.RoutingRejections > 0` to anomaly condition. Clusters where all failures are routing rejections (no routable instances) silently omitted the anomaly summary block (I3 from sriumcp review). - sim/cluster/metrics.go: Make CollectRawMetrics routingRejections parameter explicit (non-variadic). Prevents call sites from silently passing 0 and missing routing rejections. Updated all test call sites to pass 0 explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * revert(ci): remove skip-cache from golangci-lint step Unrelated to Phase 1A changes. skip-cache: true was added during development but should not be merged to main. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cluster): address sriumcp Round 2 — test, stale comment, duplicate Metrics.Requests - sim/cluster/instance_lifecycle_test.go: Rewrite conservation test to actually seed requests into inst0's WaitQ before drain. Previous version used a manual event loop on an empty queue — DrainWaitQueue() returned [] and no redirection ever occurred. New version uses inst0.sim.EnqueueRequest directly + empty workload so Run() doesn't push duplicate arrivals. Also adds pre/post assertions: QueueDepth==0, inFlightRequests==0, and clusterEvents non-empty after drain. - sim/request.go: Update Redirected field comment to reflect actual behavior. Previous comment said "completion accounting is skipped" — opposite of what simulator.go:recordRequestCompletion now does. - sim/cluster/infra_lifecycle_event.go: Delete stale Metrics.Requests entry for redirected requests before re-injection. Source registered the request at EnqueueRequest time; DrainWaitQueue empties WaitQ but left the entry. Destination re-registers on re-enqueue, causing a spurious "duplicate request ID" WARN in aggregateMetrics() for every redirected request. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: address sriumcp documentation gaps (GAP 1-4) GAP 1 — configuration.md: Add node_pools and instance_lifecycle YAML schema to the Policy Bundle section so users can discover and configure Phase 1A features. Both are YAML-only (no CLI flags). Add a note block explaining backward compatibility. Update DeploymentConfig row in the config-to-flag mapping table to note the YAML-only fields. GAP 2 — results.md Anomaly Counters: Rename "Rejected Requests" to "Rejected Requests (Admission)" to match actual CLI output label. Add new "Rejected Requests (Routing)" row explaining when it fires (no routable instances — all Loading/Draining) and the remediation action. GAP 3 — results.md Per-Model Metrics: Change mean= to p50= in the example output block to match printPerModelMetrics which uses m.TTFT.P50. Add tok/s to the Throughput example line to match actual output format. GAP 4 — results.md per_model JSON: Add table documenting the per_model key in --results-path JSON output (omitted when no model tags present), with field-by-field description. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --------- Co-authored-by: tantawi <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]>

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch 2 times, most recently from 1871f69 to e242160 Compare March 17, 2026 16:14

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from d24dec5 to bbf6b2c Compare March 17, 2026 17:55

sriumcp requested changes Mar 17, 2026

View reviewed changes

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 1604c72 to 3df8b8b Compare March 18, 2026 13:31

susiejojo requested a review from sriumcp March 18, 2026 13:56

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 3df8b8b to 1a60346 Compare March 18, 2026 14:47

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from e5dec14 to 93844c8 Compare March 18, 2026 16:56

susiejojo and others added 7 commits March 19, 2026 10:14

style: gofmt struct/var alignment after WeightBytesPerParam addition

eebb2f1

Convergence R3: gofmt -w to re-align ModelConfig struct fields and cmd/root.go var block after longest field name changed. Co-Authored-By: Claude Opus 4.6 <[email protected]>

style: gofmt const/map alignment in kv_capacity files

6b9ae7c

Co-Authored-By: Claude Opus 4.6 <[email protected]>

susiejojo and others added 4 commits March 19, 2026 10:14

susiejojo force-pushed the feat/443-quantized-weight-bytes-per-param branch from 93844c8 to 4b51d0c Compare March 19, 2026 14:15

sriumcp reviewed Mar 19, 2026

View reviewed changes

susiejojo mentioned this pull request Mar 19, 2026

feat: Add --quantization flag for on-the-fly quantization #629

Open

7 tasks

fix: address cosmetic review comments

5dcc5cd

- Add bitsPerByte constant to replace magic number 8.0 (issue #6) - Improve roofline approximation comment with quantitative guidance (issue #8)

susiejojo mentioned this pull request Mar 19, 2026

Extend A100/L40S hardware support for BLIS-roofline + H100-native FP8 compute #762

Closed

7 tasks

sriumcp reviewed Mar 19, 2026

View reviewed changes

susiejojo requested a review from sriumcp March 19, 2026 17:55

sriumcp approved these changes Mar 19, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

sriumcp merged commit 319d1b7 into main Mar 19, 2026
4 checks passed

susiejojo deleted the feat/443-quantized-weight-bytes-per-param branch March 20, 2026 18:17

Conversation

susiejojo commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

susiejojo commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(latency): decouple quantized weight precision from compute dtype

Summary

🔴 Issue: FP8 quantization_config parsing doesn't check bits=8 — only falls back when bits <= 0

🟡 Minor: ValidateRooflineConfig doesn't validate WeightBytesPerParam > BytesPerParam

🟡 Minor: JSON tag weight_bytes_per_param,omitempty on ModelConfig

✅ Positive observations

Net Assessment

Uh oh!

susiejojo commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review follow-up: all three concerns addressed ✅

🔴 FP8 case-insensitive match — Fixed

🟡 WeightBytesPerParam > BytesPerParam validation — Fixed

🟡 omitempty JSON tag — Acknowledged (no change needed)

Uh oh!

susiejojo commented Mar 17, 2026

Follow-up: complete three-tier detection, remove --weight-bytes-per-param flag

Uh oh!

sriumcp left a comment

Choose a reason for hiding this comment

PR Review: Quantized Weight Precision Detection

Critical Issues

C1. cmd/replay.go not updated — R23 code path parity violation

C2. Non-deterministic map iteration in compressed-tensors parsing — INV-6 violation

Important Issues

I1. DRY violation: duplicated model-name-fallback in cmd/root.go

I2. Silent type coercion failure in bits field

I3. No rooflineStepTime end-to-end test for quantized models

I4. Missing Inf test for ValidateRooflineConfig.WeightBytesPerParam

Minor

Uh oh!

sriumcp commented Mar 17, 2026

Documentation Gaps

D1. docs/guide/latency-models.md:213 — Stale "Current limitations" warning

D2. docs/concepts/roofline.md:9-10 — Stale "Scope" note

D3. docs/reference/models.md:47-48 — Stale "Parallelism and quantization" note

D4. docs/reference/project-structure.md:42 — Missing WeightBytesPerParam in file annotation

D5. (Optional) docs/contributing/extension-recipes.md — No recipe for adding a quantization format

Uh oh!

susiejojo commented Mar 17, 2026

Review fixes addressed (commit 88c548b)

Critical

Important

Minor

Uh oh!

sriumcp commented Mar 18, 2026

Convergence Review — Round 1 (pr-code gate, sonnet model)

Result: NOT CONVERGED — 0 CRITICAL, 14 IMPORTANT, ~28 SUGGESTION

IMPORTANT Findings (14)

I1: Non-deterministic map iteration in compressed-tensors parsing (INV-6 violation)

I2: Stale documentation — 3 pages say "quantization not yet modeled"

I3: DRY violation — duplicate model-name fallback + warning in two CLI paths

I4: Trained-roofline backend silently ignores WeightBytesPerParam

I5: No upper-bound check on bits or WeightBytesPerParam

I6: FP8 KV cache dtype not modeled

I7: GPTQ/AWQ group quantization overhead (~6%) ignored

I8: "Conservative (pessimistic)" comment is inaccurate

I9: Duplicate WeightBytesPerParam validation (DRY)

I10: No CLI flag for manual WeightBytesPerParam override

I11: No validated quantized model example in docs

I12: Misleading quantization warning in blackbox path

I13: Pre-existing FractionalBytesPerParam test semantically misleading

I14: Tests couple to internal map keys of unexported function

Uh oh!

susiejojo commented Mar 18, 2026

Review Comments Addressed

From #issuecomment-4078259809 (D1–D5: Documentation gaps)

susiejojo commented Mar 16, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading

🔴 Issue: FP8 `quantization_config` parsing doesn't check `bits=8` — only falls back when `bits <= 0`

🟡 Minor: `ValidateRooflineConfig` doesn't validate `WeightBytesPerParam > BytesPerParam`

🟡 Minor: JSON tag `weight_bytes_per_param,omitempty` on `ModelConfig`

claude bot commented Mar 17, 2026 •

edited

Loading

🟡 `WeightBytesPerParam > BytesPerParam` validation — Fixed

🟡 `omitempty` JSON tag — Acknowledged (no change needed)

Follow-up: complete three-tier detection, remove `--weight-bytes-per-param` flag

C1. `cmd/replay.go` not updated — R23 code path parity violation

I1. DRY violation: duplicated model-name-fallback in `cmd/root.go`

I2. Silent type coercion failure in `bits` field

I3. No `rooflineStepTime` end-to-end test for quantized models

I4. Missing `Inf` test for `ValidateRooflineConfig.WeightBytesPerParam`

D1. `docs/guide/latency-models.md:213` — Stale "Current limitations" warning

D2. `docs/concepts/roofline.md:9-10` — Stale "Scope" note

D3. `docs/reference/models.md:47-48` — Stale "Parallelism and quantization" note

D4. `docs/reference/project-structure.md:42` — Missing `WeightBytesPerParam` in file annotation

D5. (Optional) `docs/contributing/extension-recipes.md` — No recipe for adding a quantization format

Review fixes addressed (commit `88c548b`)

I4: Trained-roofline backend silently ignores `WeightBytesPerParam`

I5: No upper-bound check on `bits` or `WeightBytesPerParam`

I9: Duplicate `WeightBytesPerParam` validation (DRY)

I10: No CLI flag for manual `WeightBytesPerParam` override

I13: Pre-existing `FractionalBytesPerParam` test semantically misleading