fix(latency): model prefill-decode interference in aggregated roofline by namasl · Pull Request #951 · inference-sim/inference-sim

namasl · 2026-04-07T00:25:44Z

Summary

Adds MixedBatchPenalty to HardwareCalib: scales effective MFU down by the minority-phase fraction when prefill and decode tokens share a step, modeling GPU inefficiency from mixed memory-access patterns and attention tile-size mismatches
Adds OverlapPenalty to HardwareCalib: replaces max(compute, memory) with max(compute, memory) + penalty × min(compute, memory), modeling imperfect compute/memory overlap from SM scheduling contention
Both parameters default to 0.0 — zero-change behavior for all existing configs and tests
CLI flags --mixed-batch-penalty and --overlap-penalty with Changed() guards (R18) allow per-run override of hardware_config.json values
Validation in ValidateRooflineConfig rejects values outside [0, 1]

Closes #949

Test plan

BC-1: MixedBatchPenalty > 0 strictly increases step time for compute-bound mixed batches
BC-2: MixedBatchPenalty has no effect on pure-prefill or pure-decode batches
BC-3: larger minority fraction (50/50 vs 90/10) produces a higher penalty ratio
BC-4: OverlapPenalty > 0 strictly increases step time for any non-empty batch
BC-5: higher OverlapPenalty produces greater or equal step time (monotonicity)
BC-6: both penalties at 0.0 reproduce the original max(compute, memory) result exactly
BC-7: ValidateRooflineConfig rejects negative, > 1, NaN, and Inf values for both fields
All pre-existing roofline, latency, and cluster tests continue to pass (go test ./...)

🤖 Generated with Claude Code

namasl · 2026-04-07T00:27:13Z

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

claude · 2026-04-07T00:27:50Z

Claude finished @namasl's task in 5m 23s —— View job

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:

Initialize review
Read changed files and parent issue
Review correctness vs parent issue Roofline latency model underestimates prefill-decode interference in aggregated mode #949
Check standards (rules.md, invariants.md, principles.md)
Review behavioral tests and non-structural tests
Check API boundaries, contracts, modularity
Check documentation correctness
Compile and post final review

Summary

The PR correctly addresses both gaps identified in issue #949 with clean, localized changes. Core implementation and test plan are solid. Three documentation gaps need attention before merge.

Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation: ✅ Correctly implemented. The formula penalty = 1 - MixedBatchPenalty × minorityFraction divides totalComputeS by a value in [0.5, 1.0], increasing compute time proportionally to batch imbalance. At MixedBatchPenalty=0 the block is skipped entirely — exact backward compatibility.

Gap 2 — Imperfect compute/memory overlap: ✅ Correctly implemented. max(c, m) + OverlapPenalty × min(c, m) is the exact formula proposed in the issue. At OverlapPenalty=0 it reduces to max(c, m) — exact backward compatibility.

R11 (division guard): ✅ penalty ∈ [0.5, 1.0] — mathematically guaranteed since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5]. Maximum degradation is ×2 at 50/50 split with MixedBatchPenalty=1. The comment on line 283–284 correctly explains this.

R18 (CLI flag precedence): ✅ Both flags use cmd.Flags().Changed() guards, preventing silent overwrite of hardware_config.json values when flags aren't explicitly provided.

R4 (construction site audit): ✅ HardwareCalib{} construction sites checked — test helpers and cluster tests use zero-valued struct literals, which correctly zero-initialize the new fields (both defaulting to "no penalty"). No missed construction sites.

Issues Found

Important

1. docs/concepts/roofline.md fields table is incomplete

Lines 99–105 list hardware_config.json fields as:

Field	Description
`TFlopsPeak`	...
`BwPeakTBs`	...
`mfuPrefill`	...
`mfuDecode`	...

mixedBatchPenalty and overlapPenalty are not in this table. A user extending the config for a new GPU has no documentation for these optional fields or their semantics. This is the primary user-facing developer reference for the roofline hardware config.

Fix this →

2. docs/guide/latency-models.md "Supported hardware" section doesn't list new fields

The "Supported hardware" section states that the required fields are TFlopsPeak, BwPeakTBs, mfuPrefill, mfuDecode, MemoryGiB. The new optional fields are not listed anywhere here. A user checking this guide to add a new GPU entry has no guidance on the penalty parameters.

Fix this →

3. docs/reference/configuration.md CLI flags table missing new flags

The configuration reference has a table of CLI flags. --mixed-batch-penalty and --overlap-penalty are not listed. They have good --help text in cmd/root.go (lines 981–988) but the reference doc is used for structured discovery.

Fix this →

Suggestions

4. Variable naming: penalty is a divisor, not an additive penalty

In roofline.go:285, the local variable is named penalty but its semantics are inverted — penalty = 0.5 means maximum degradation (2× compute time), while penalty = 1.0 means no degradation. The name reads as if higher = worse, but numerically lower = worse. A name like mfuEfficiencyFactor or computeScaleFactor would be less surprising:

// Before
penalty := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= penalty

// Clearer
mfuEfficiencyFactor := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= mfuEfficiencyFactor

The comment on lines 283–284 partially compensates but the variable name itself will confuse the next reader who isn't expecting a divisor.

5. BC-6 test is structurally coupled to formula internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible (roofline_test.go) manually reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and the roofline formula to derive expected. Per the BDD/TDD standards: "Would this test still pass if the implementation were completely rewritten but the behavior preserved?" — no, because it hardcodes the two-loop structure and clampToInt64 call.

A simpler behavioral alternative: capture a known-good expected value (golden-style) or assert that zero-penalty results are <= any nonzero-penalty result for the same batch. The backward-compatibility law can be tested without reimplementing the implementation. This is a suggestion, not a blocker — the test does provide strong regression coverage.

Standards Compliance Summary

Rule	Status
R3: numeric parameter validation	✅ Both fields validated in `ValidateRooflineConfig`
R4: construction site audit	✅ No missed HardwareCalib construction sites
R6: no logrus.Fatal in sim/	✅ Error returns only
R9: pointer types for YAML zero ambiguity	⚠️ `float64` used instead of `*float64`; acceptable here since 0.0 = "no penalty" = "not set" semantically — the `Changed()` guard handles the CLI case. Not a meaningful violation given the semantics but worth noting.
R11: division guard	✅ `penalty ∈ [0.5, 1.0]` proven in comment
R18: CLI flag precedence	✅ `Changed()` guards on both flags

Test Plan Coverage

BC-1 through BC-7 tests are all present and cover the stated behavioral contracts:

BC-1 ✅: compute-bound mixed batch increases with penalty
BC-2 ✅: pure-prefill and pure-decode unaffected (two separate tests)
BC-3 ✅: larger minority fraction → higher penalty ratio (verified via ratio comparison, not absolute time, to handle different total token counts — correct approach)
BC-4 ✅: overlap penalty increases step time for any non-empty batch
BC-5 ✅: monotonicity of overlap penalty
BC-6 ✅: zero penalties reproduce original formula (structurally coupled, see suggestion 5)
BC-7 ✅: validation rejects out-of-range and NaN/Inf values for both fields

The testHardwareCalib() helper correctly leaves the new fields zero (Go zero-value), so all 235 pre-existing roofline tests act as BC-6-equivalent coverage implicitly.

Verdict

Ready to merge after addressing documentation gaps 1–3. The implementation is correct, backward-compatible, localized to rooflineStepTime, and well-tested. Issues 1–3 are user-facing documentation that new contributors and GPU calibrators will hit. Issues 4–5 are low-priority suggestions.

namasl · 2026-04-07T00:54:12Z

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

claude · 2026-04-07T00:54:38Z

Claude finished @namasl's task in 1m 17s —— View job

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:

Initialize review
Read changed files and parent issue
Review correctness vs parent issue Roofline latency model underestimates prefill-decode interference in aggregated mode #949
Check standards (rules.md, invariants.md, principles.md)
Review behavioral tests and non-structural tests
Check API boundaries, contracts, modularity
Check documentation correctness and completeness
Compile and post final review

Summary

The PR correctly addresses both gaps from issue #949 with clean, localized changes. The implementation is sound and backward-compatible. The previous automated review incorrectly flagged three documentation gaps that are all already present in this PR's diff. Three test quality issues need attention — one of which makes a core test vacuous.

Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation: ✅ mfuScaleFactor = 1 - MixedBatchPenalty × minorityFraction divides totalComputeS by a value in [0.5, 1.0], increasing compute time proportionally to batch imbalance. Guard condition MixedBatchPenalty > 0 && len(Prefill) > 0 && len(Decode) > 0 ensures the block is fully skipped at default 0 — exact backward compatibility.

Gap 2 — Imperfect compute/memory overlap: ✅ max(c, m) + OverlapPenalty × min(c, m) is exactly the formula proposed in the issue. At OverlapPenalty=0 it reduces to max(c, m) — exact backward compatibility.

R11 (division guard): ✅ mfuScaleFactor ∈ [0.5, 1.0] — mathematically proven (MixedBatchPenalty ∈ [0,1] × minorityFraction ∈ [0, 0.5] ≤ 0.5). Comment on lines 283–284 documents this. Division by totalTokens at line 281 is guarded by the outer condition requiring both PrefillRequests and DecodeRequests to be non-empty.

R18 (CLI flag precedence): ✅ Both flags use cmd.Flags().Changed() guards in resolveLatencyConfig.

R4 (construction site audit): ✅ Only two HardwareCalib{} production sites exist (config.go:89,98), both error returns that correctly zero-initialize. All test sites either use the blackbox backend (penalty never reached) or zero-initialize correctly.

Issues Found

Important — Test Quality

1. BC-3 is vacuous: the assertion is never exercised

TestRooflineStepTime_MixedBatchPenalty_SymmetricMinority (roofline_test.go:1211) is intended to verify that a 50/50 token split produces a higher penalty ratio than a 90/10 split. However, unlike BC-1 which explicitly sets BwPeakTBs = 1e6 to force compute-bound behavior, BC-3 uses testHardwareCalib() with standard H100 specs (BwPeakTBs = 3.35).

MixedBatchPenalty exclusively scales totalComputeS. For a 32-layer/4096-hidden model, weight bandwidth alone (~15 GB / 3.35 TB/s ≈ 4.5 ms) dwarfs compute time for 50–100 token batches (~256 GFLOPS / (989 TFLOPS × 0.55) ≈ 0.47 ms). Both batches are memory-bound, so max(totalComputeS, totalMemoryS) = totalMemoryS regardless of the penalty, and both ratios equal 1.0. The assertion ratio5050 >= ratio9010 becomes 1.0 >= 1.0 — trivially true. Deleting the entire MixedBatchPenalty block from roofline.go would not cause this test to fail.

Fix: add hc.BwPeakTBs = 1e6 and hcNoPenalty.BwPeakTBs = 1e6 in BC-3, matching the established technique from BC-1.

Fix this →

2. Flag parity tests don't cover new flags

Two tests in cmd/simconfig_shared_test.go maintain hardcoded flag lists:

TestRunCmd_SimConfigFlagsParity (line 76) — missing "mixed-batch-penalty" and "overlap-penalty"
TestBothCommands_SimConfigFlagsHaveIdenticalDefaults (line 163) — also missing both new flags

While the flags ARE correctly registered for both commands via registerSimConfigFlags (replay.go:303), these parity guards exist specifically to catch future regressions. Omitting the new flags creates a blind spot.

Fix this →

Suggestions

3. No test for MixedBatchPenalty on a memory-bound mixed batch (no-op confirmation)

BC-2 tests pure-prefill and pure-decode batches. It doesn't test the case where a mixed batch is memory-bound — the penalty should still be a no-op in this regime (the penalty applies to totalComputeS, which is not the bottleneck). This would be a useful regression guard in case someone accidentally applies the penalty to totalMemoryS instead.

4. OverlapPenalty = 1.0 boundary case has no behavioral test

At OverlapPenalty = 1.0, the formula reduces to c + m (additive rather than max). BC-7 validates that 1.0 is accepted, but no behavioral test checks the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-facing case and a natural invariant to verify.

5. BC-6 is structurally coupled (per BDD/TDD standards)

TestRooflineStepTime_ZeroPenalties_BackwardCompatible reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline. Per the project's refactoring survival test ("Would this test still pass if the implementation were completely rewritten?") — no, it would fail. The behavioral contract ("zero penalties produce identical output to pre-#949") could be tested with a physics invariant instead: for a memory-bound step, result × peakBW ≈ totalBytes; compare against the existing TestRooflineStepTime_SingleCrossover_MemoryBoundDecode approach. This is low priority but noted for completeness.

Minor Documentation Gap

6. roofline.md Section 2 formula is outdated

Sections 2–4 of docs/concepts/roofline.md describe the mathematical formula as max(FLOPs/Peak, Bytes/BW). After this PR, the actual formula is max(c, m) + OverlapPenalty × min(c, m) for non-default configs. The new fields are correctly documented in the table (lines 104–105), but the conceptual formula in the overview section doesn't reflect the change.

Fix this →

Correction to Previous Review

The previous automated review identified three "Important" documentation gaps. All three are already present in this PR:

docs/concepts/roofline.md lines 104–105: ✅ both new fields listed in the table
docs/guide/latency-models.md line 70: ✅ "optional interference fields" added
docs/reference/configuration.md: ✅ both CLI flags added to the table

Standards Compliance

Rule	Status	Notes
R1: No silent data loss	✅
R2: Map key sorting	✅	No map iteration in new code
R3: Numeric validation	✅	NaN/Inf/<0/>1 all rejected; NaN correctly passes `!= 0` guard
R4: Construction site audit	✅	Only error-return sites in production code; all zero-initialize correctly
R6: No logrus.Fatal in sim/	✅
R9: *float64 for YAML zero-value	⚠️	`float64` used; acceptable since 0.0 ≡ "not set" for these fields
R11: Division guard	✅	`mfuScaleFactor ∈ [0.5, 1.0]` proven; `totalTokens` guarded by non-empty slice condition
R18: CLI flag precedence	✅	`Changed()` guards on both flags

Verdict

Recommend fixing issues 1–2 before merge — BC-3 is currently vacuous (deleting the implementation wouldn't break it), and the parity test omission leaves a regression gap. Issues 3–6 can go to a follow-up.

namasl · 2026-04-07T01:14:01Z

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

claude · 2026-04-07T01:14:27Z

Claude finished @namasl's task in 3m 47s —— View job

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:

Initialize review
Read changed files and gather context
Review correctness vs parent issue Roofline latency model underestimates prefill-decode interference in aggregated mode #949
Check standards (rules.md, invariants.md, principles.md)
Review behavioral tests and non-structural tests
Check API boundaries, contracts, modularity
Check documentation correctness and completeness
Compile and post final review

Summary

This is the third review pass. The first two automated reviews identified issues that the author has since resolved. I re-read all 11 changed files from scratch against the current HEAD. All important issues are resolved. The PR is ready to merge, with two residual low-priority suggestions and one new observation.

Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation: ✅ Formula correct. Guard condition MixedBatchPenalty > 0 && len(PrefillRequests) > 0 && len(DecodeRequests) > 0 ensures exact backward compatibility at default 0. The variable is named mfuScaleFactor (fixed from penalty in previous review). Division guard proven in comment: mfuScaleFactor ∈ [0.5, 1.0] since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5].

Gap 2 — Imperfect compute/memory overlap: ✅ Formula max(c,m) + OverlapPenalty × min(c,m) exactly matches the issue proposal. At OverlapPenalty=0 reduces to max(c,m) — exact backward compatibility.

Scope note: The penalties apply only to RooflineLatencyModel (--latency-model roofline). The TrainedRooflineLatencyModel (--latency-model trained-roofline) has its own learned corrections in trained_roofline.go and is not affected by these parameters. This is architecturally correct — the trained model's beta coefficients capture these effects through regression — but worth noting for users who configure hardware_config.json penalties and then switch backends.

Previous Review Issues — Resolution Status

All items flagged as "Important" in the second automated review have been resolved in the latest two commits:

Issue	Status
BC-3 vacuous (memory-bound regime)	✅ Fixed: `hc.BwPeakTBs = 1e6` added to both `hc` and `hcNoPenalty` in BC-3
Flag parity tests missing new flags	✅ Fixed: `"mixed-batch-penalty"` and `"overlap-penalty"` added to both lists in `cmd/simconfig_shared_test.go`
Formula docs outdated in `roofline.md`	✅ Fixed: conceptual section now documents both corrections with formulas
`roofline.md` table missing new fields	✅ Present: lines 109–110 in current file
`latency-models.md` missing optional fields	✅ Present: line 70 mentions interference fields
`configuration.md` missing CLI flags	✅ Present: both flags added to the table
Variable name `penalty` (inverted semantics)	✅ Fixed: renamed to `mfuScaleFactor`

Standards Compliance

Rule	Status	Notes
R1: No silent data loss	✅
R3: Numeric validation	✅	NaN, Inf, <0, >1 all rejected. `NaN != 0` is `true` in IEEE 754 so NaN correctly enters the validation block; `IsNaN` then catches it. `-Inf < 0` and `+Inf > 1` also caught by range checks.
R4: Construction site audit	✅	Both production construction sites (`config.go:89,98`) are error-return paths that zero-initialize correctly. All test sites either use blackbox backend or zero-initialize.
R6: No logrus.Fatal in sim/	✅
R9: `*float64` for YAML zero-value ambiguity	N/A	Hardware config uses `encoding/json` (non-strict, unknown fields ignored). 0.0 ≡ "not set" is semantically unambiguous here — no Inf/NaN values masquerade as "unset".
R11: Division guard	✅	`mfuScaleFactor ∈ [0.5, 1.0]` proven. `totalTokens` division guarded by non-empty slice precondition.
R18: CLI flag precedence	✅	`Changed()` guards on both flags in `resolveLatencyConfig`.

Test Coverage Assessment

All BC-1 through BC-7 tests are present and, after the BC-3 fix, non-vacuous:

Test	Verdict
BC-1: penalty strictly increases compute-bound mixed batch	✅ Non-vacuous: `BwPeakTBs=1e6` forces compute-bound
BC-2: pure-prefill and pure-decode unaffected	✅ Uses standard H100 — memory-bound, penalty correctly skipped by guard
BC-3: 50/50 ratio ≥ 90/10 ratio	✅ Fixed: both `hc` and `hcNoPenalty` now use `BwPeakTBs=1e6`
BC-4: overlap penalty increases step time for any non-empty batch	✅ Valid: `min(compute,memory) > 0` always for non-empty batch
BC-5: overlap penalty monotonicity	✅
BC-6: zero penalties reproduce original formula	✅ (structurally coupled — see suggestion below)
BC-7: validation rejects out-of-range and NaN/Inf	✅ 6 tests covering negative, >1, and NaN for each field; Inf caught by range tests

Pre-existing tests serve as implicit BC-6 coverage: all 235+ prior roofline tests use testHardwareCalib() which zero-initializes both new fields, continuously validating backward compatibility.

Remaining Suggestions (non-blocking)

1. BC-6 is structurally coupled to implementation internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline to derive expected. Per the BDD/TDD principles ("Would this test still pass if the implementation were completely rewritten?"): no, it would fail if the internal FLOP formula changed, even if the observable behavior (physics) was preserved. The 235+ pre-existing tests already provide this backward-compatibility guarantee implicitly. This is a suggestion, not a blocker — the test provides strong regression coverage.

2. No behavioral test for OverlapPenalty = 1.0 boundary

At OverlapPenalty=1.0 the formula reduces to compute + memory (fully additive). BC-7 validates that 1.0 is accepted, but no test asserts the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-configurable case and a natural physics invariant to verify. Not required for this PR's stated goals.

Verdict

Ready to merge. All critical and important issues from the two prior reviews have been resolved. The implementation is correct, backward-compatible, localized to rooflineStepTime and ValidateRooflineConfig, and well-tested. The two remaining suggestions above are follow-up candidates, not blockers.

inference-sim#949) Two gaps in rooflineStepTime() caused aggregated (colocated) mode to appear better than real systems, masking the advantages of PD disaggregation: Gap 1 — Mixed-batch MFU degradation: added MixedBatchPenalty to HardwareCalib. When both prefill and decode tokens share a step, compute time is scaled by 1/(1 - penalty * minority_fraction), modeling GPU inefficiency from mixed memory-access patterns and attention kernel tile-size mismatches. Gap 2 — Imperfect compute/memory overlap: added OverlapPenalty to HardwareCalib. Replaces max(compute, memory) with max(compute, memory) + penalty * min(compute, memory), modeling SM scheduling contention between compute-bound prefill and memory-bound decode work. Both parameters default to 0.0 for exact backward compatibility. Validation in ValidateRooflineConfig enforces [0, 1] range. CLI flags --mixed-batch-penalty and --overlap-penalty override hardware_config.json values (R18 Changed() guard). hardware_config.json updated with explicit 0.0 entries for all GPU profiles. Behavioral contracts verified: BC-1 (penalty strictly increases compute-bound mixed batches), BC-2 (no effect on pure-phase batches), BC-3 (larger minority fraction → more penalty), BC-4/BC-5 (overlap penalty monotonicity), BC-6 (zero penalties reproduce original max() formula exactly), BC-7 (validation rejects out-of-range inputs). Closes inference-sim#949 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…variable name Add mixedBatchPenalty and overlapPenalty to hardware_config.json fields table in roofline.md, latency-models.md supported hardware section, and configuration.md CLI flags table. Rename local `penalty` variable to `mfuScaleFactor` in rooflineStepTime for clarity since lower values mean greater degradation (it's a divisor, not an additive penalty). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ula docs - BC-3 (SymmetricMinority): set BwPeakTBs=1e6 on both hc and hcNoPenalty to force compute-bound regime, making the penalty ratio assertion non-trivial (was previously 1.0 >= 1.0 due to memory-bound batches) - Add mixed-batch-penalty and overlap-penalty to both flag parity tests in simconfig_shared_test.go (TestRunCmd_SimConfigFlagsParity and TestBothCommands_SimConfigFlagsHaveIdenticalDefaults) - Update roofline.md Section 2 formula to document the optional mixed-batch and overlap penalty corrections Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

namasl · 2026-04-08T14:00:04Z

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness.

claude · 2026-04-08T14:00:31Z

Claude finished @namasl's task in 6m 0s —— View job

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:

Initialize review
Read changed files from current HEAD
Review correctness vs parent issue Roofline latency model underestimates prefill-decode interference in aggregated mode #949
Check standards compliance (rules.md, invariants.md, principles.md)
Review behavioral tests and non-structural tests
Check API boundaries, contracts, modularity
Check documentation correctness and completeness
Run tests (Go not available in this environment — reviewed by code inspection + CI)
Compile and post final review

Summary

Fourth review pass. All important issues from prior reviews have been resolved. The implementation is correct, backward-compatible, and well-tested. One new minor observation; two low-priority suggestions from review #3 remain open. Ready to merge.

Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation (roofline.go:329-341): ✅

Formula: mfuScaleFactor = 1.0 - MixedBatchPenalty × minorityFraction; divides totalComputeS.
Guard: MixedBatchPenalty > 0 && len(PrefillRequests) > 0 && len(DecodeRequests) > 0 — block fully skipped at default 0.
Range proof in comment: mfuScaleFactor ∈ [0.5, 1.0] since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5]. No division-by-zero possible.
Variable name mfuScaleFactor (not the confusing penalty from earlier drafts) — correct.

Gap 2 — Imperfect compute/memory overlap (roofline.go:361): ✅

Formula: max(c,m) + OverlapPenalty × min(c,m) — exact match to issue proposal.
At OverlapPenalty=0: reduces to max(c,m) — exact backward compatibility.

Scope boundary (note for reviewers): Both penalties apply only to --latency-model roofline. The trained-roofline backend learns its own correction coefficients (the beta array) through regression — it correctly does not use these parameters. This is architecturally sound but worth noting for users who configure hardware_config.json penalties and then switch backends.

Resolution Status of Prior Review Issues

Issue	Status
BC-3 vacuous (memory-bound regime)	✅ Fixed: `BwPeakTBs = 1e6` on both `hc` and `hcNoPenalty`
Flag parity tests missing new flags	✅ Fixed: both flags in `simconfig_shared_test.go:80,171`
`roofline.md` formula section outdated	✅ Fixed: Section 2 documents both corrections with formulas
`roofline.md` table missing new fields	✅ Present: lines 111–112
`latency-models.md` missing interference fields	✅ Present: line 75
`configuration.md` missing CLI flags	✅ Present: lines 159–160
Variable name `penalty` (inverted semantics)	✅ Fixed: renamed to `mfuScaleFactor`

Standards Compliance

Rule	Status	Notes
R3: Numeric validation	✅	NaN, Inf, negative, >1 all rejected. `NaN != 0` is `true` in IEEE 754 — NaN correctly enters validation block; `IsNaN` catches it.
R4: Construction site audit	✅	Both production `HardwareCalib{}` sites are error-return paths that zero-initialize correctly. No missed sites.
R6: No `logrus.Fatal` in `sim/`	✅
R9: `*float64` for YAML zero-value	N/A	Uses `encoding/json` (lenient), not strict YAML. 0.0 ≡ "not set" is unambiguous.
R11: Division guard	✅	`mfuScaleFactor ∈ [0.5, 1.0]` proven. `totalTokens` guarded by non-empty slice precondition on the outer `if`.
R18: CLI flag precedence	✅	`Changed()` guards at `root.go:685,688`.

Hardware config backward compatibility: encoding/json zero-initializes missing fields. Existing hardware_config.json files without mixedBatchPenalty/overlapPenalty will parse correctly with both = 0.0 (no-penalty behavior). ✅

Test Coverage

Test	Status
BC-1: Penalty strictly increases compute-bound mixed batch	✅ Non-vacuous: `BwPeakTBs=1e6` forces compute-bound
BC-2: Pure-prefill/decode unaffected	✅ Guard condition skips block entirely
BC-3: 50/50 ratio ≥ 90/10 ratio	✅ Fixed: both `hc` and `hcNoPenalty` use `BwPeakTBs=1e6`
BC-4: OverlapPenalty increases step time for non-empty batch	✅ `min(compute,memory) > 0` guaranteed
BC-5: OverlapPenalty monotonicity	✅
BC-6: Zero penalties reproduce original formula	✅ (structurally coupled — see suggestion below)
BC-7: Validation rejects negative, >1, NaN for both fields	✅ 6 reject tests + 1 valid-boundary table test

Pre-existing 235+ roofline tests (using testHardwareCalib() with zero-initialized new fields) serve as continuous BC-6-equivalent regression coverage.

New Observation

CLI validation pathway for out-of-range penalty flags

If a user passes --mixed-batch-penalty 1.5, the Changed() guard sets hwConfig.MixedBatchPenalty = 1.5 without early CLI-side validation. The value is caught downstream: ValidateRooflineConfig inside NewLatencyModel returns an error → NewInstanceSimulator panics with a descriptive message: "NewInstanceSimulator(…): NewLatencyModel: latency model: invalid roofline config: HardwareCalib.MixedBatchPenalty must be in [0, 1] when set, got 1.5".

This is consistent with how other NewInstanceSimulator failures behave. The panic message is clear. An early logrus.Fatalf in resolveLatencyConfig (matching the alpha/beta validation pattern at root.go:333-339) would be more user-friendly, but this is a minor polish item, not a blocker.

Remaining Low-Priority Suggestions (unchanged from review #3)

1. BC-6 is structurally coupled to implementation internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible (roofline_test.go:1447) reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline. Per BDD/TDD standards ("Would this test still pass if the implementation were completely rewritten?"): no. The 235+ pre-existing tests already provide the backward-compatibility guarantee implicitly. Consider replacing with a simpler physics invariant instead. Not a blocker.

2. No behavioral test for OverlapPenalty = 1.0 boundary

At OverlapPenalty=1.0, the formula reduces to compute + memory (fully additive). BC-7 validates that 1.0 is accepted, but no test asserts that the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-configurable case and a natural physics invariant. Not a blocker.

Verdict

Ready to merge. The implementation is correct, backward-compatible, localized to rooflineStepTime and ValidateRooflineConfig, and all critical/important issues from prior reviews have been resolved. The two remaining suggestions above and the new CLI-validation observation are polish items for a follow-up, not blockers.

namasl marked this pull request as draft April 8, 2026 13:38

namasl and others added 3 commits April 8, 2026 13:59

namasl force-pushed the fix/949-roofline-prefill-decode-interference branch from 3a25c4c to fa09485 Compare April 8, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(latency): model prefill-decode interference in aggregated roofline#951

fix(latency): model prefill-decode interference in aggregated roofline#951
namasl wants to merge 3 commits intoinference-sim:mainfrom
namasl:fix/949-roofline-prefill-decode-interference

namasl commented Apr 7, 2026

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

namasl commented Apr 8, 2026

Uh oh!

claude Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namasl commented Apr 7, 2026

Summary

Test plan

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Summary

Correctness vs Issue #949

Issues Found

Important

Suggestions

Standards Compliance Summary

Test Plan Coverage

Verdict

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Summary

Correctness vs Issue #949

Issues Found

Important — Test Quality

Suggestions

Minor Documentation Gap

Correction to Previous Review

Standards Compliance

Verdict

Uh oh!

namasl commented Apr 7, 2026

Uh oh!

claude Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Summary

Correctness vs Issue #949

Previous Review Issues — Resolution Status

Standards Compliance

Test Coverage Assessment

Remaining Suggestions (non-blocking)

Verdict

Uh oh!

namasl commented Apr 8, 2026

Uh oh!

claude Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Summary

Correctness vs Issue #949

Resolution Status of Prior Review Issues

Standards Compliance

Test Coverage

New Observation

Remaining Low-Priority Suggestions (unchanged from review #3)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 7, 2026 •

edited

Loading

claude Bot commented Apr 7, 2026 •

edited

Loading

claude Bot commented Apr 7, 2026 •

edited

Loading

claude Bot commented Apr 8, 2026 •

edited

Loading