Skip to content

fix(latency): model prefill-decode interference in aggregated roofline#951

Draft
namasl wants to merge 3 commits intoinference-sim:mainfrom
namasl:fix/949-roofline-prefill-decode-interference
Draft

fix(latency): model prefill-decode interference in aggregated roofline#951
namasl wants to merge 3 commits intoinference-sim:mainfrom
namasl:fix/949-roofline-prefill-decode-interference

Conversation

@namasl
Copy link
Copy Markdown
Contributor

@namasl namasl commented Apr 7, 2026

Summary

  • Adds MixedBatchPenalty to HardwareCalib: scales effective MFU down by the minority-phase fraction when prefill and decode tokens share a step, modeling GPU inefficiency from mixed memory-access patterns and attention tile-size mismatches
  • Adds OverlapPenalty to HardwareCalib: replaces max(compute, memory) with max(compute, memory) + penalty × min(compute, memory), modeling imperfect compute/memory overlap from SM scheduling contention
  • Both parameters default to 0.0 — zero-change behavior for all existing configs and tests
  • CLI flags --mixed-batch-penalty and --overlap-penalty with Changed() guards (R18) allow per-run override of hardware_config.json values
  • Validation in ValidateRooflineConfig rejects values outside [0, 1]

Closes #949

Test plan

  • BC-1: MixedBatchPenalty > 0 strictly increases step time for compute-bound mixed batches
  • BC-2: MixedBatchPenalty has no effect on pure-prefill or pure-decode batches
  • BC-3: larger minority fraction (50/50 vs 90/10) produces a higher penalty ratio
  • BC-4: OverlapPenalty > 0 strictly increases step time for any non-empty batch
  • BC-5: higher OverlapPenalty produces greater or equal step time (monotonicity)
  • BC-6: both penalties at 0.0 reproduce the original max(compute, memory) result exactly
  • BC-7: ValidateRooflineConfig rejects negative, > 1, NaN, and Inf values for both fields
  • All pre-existing roofline, latency, and cluster tests continue to pass (go test ./...)

🤖 Generated with Claude Code

@namasl
Copy link
Copy Markdown
Contributor Author

namasl commented Apr 7, 2026

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

Claude finished @namasl's task in 5m 23s —— View job


PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:


Summary

The PR correctly addresses both gaps identified in issue #949 with clean, localized changes. Core implementation and test plan are solid. Three documentation gaps need attention before merge.


Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation: ✅ Correctly implemented. The formula penalty = 1 - MixedBatchPenalty × minorityFraction divides totalComputeS by a value in [0.5, 1.0], increasing compute time proportionally to batch imbalance. At MixedBatchPenalty=0 the block is skipped entirely — exact backward compatibility.

Gap 2 — Imperfect compute/memory overlap: ✅ Correctly implemented. max(c, m) + OverlapPenalty × min(c, m) is the exact formula proposed in the issue. At OverlapPenalty=0 it reduces to max(c, m) — exact backward compatibility.

R11 (division guard):penalty ∈ [0.5, 1.0] — mathematically guaranteed since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5]. Maximum degradation is ×2 at 50/50 split with MixedBatchPenalty=1. The comment on line 283–284 correctly explains this.

R18 (CLI flag precedence): ✅ Both flags use cmd.Flags().Changed() guards, preventing silent overwrite of hardware_config.json values when flags aren't explicitly provided.

R4 (construction site audit):HardwareCalib{} construction sites checked — test helpers and cluster tests use zero-valued struct literals, which correctly zero-initialize the new fields (both defaulting to "no penalty"). No missed construction sites.


Issues Found

Important

1. docs/concepts/roofline.md fields table is incomplete

Lines 99–105 list hardware_config.json fields as:

Field Description
TFlopsPeak ...
BwPeakTBs ...
mfuPrefill ...
mfuDecode ...

mixedBatchPenalty and overlapPenalty are not in this table. A user extending the config for a new GPU has no documentation for these optional fields or their semantics. This is the primary user-facing developer reference for the roofline hardware config.

Fix this →

2. docs/guide/latency-models.md "Supported hardware" section doesn't list new fields

The "Supported hardware" section states that the required fields are TFlopsPeak, BwPeakTBs, mfuPrefill, mfuDecode, MemoryGiB. The new optional fields are not listed anywhere here. A user checking this guide to add a new GPU entry has no guidance on the penalty parameters.

Fix this →

3. docs/reference/configuration.md CLI flags table missing new flags

The configuration reference has a table of CLI flags. --mixed-batch-penalty and --overlap-penalty are not listed. They have good --help text in cmd/root.go (lines 981–988) but the reference doc is used for structured discovery.

Fix this →

Suggestions

4. Variable naming: penalty is a divisor, not an additive penalty

In roofline.go:285, the local variable is named penalty but its semantics are inverted — penalty = 0.5 means maximum degradation (2× compute time), while penalty = 1.0 means no degradation. The name reads as if higher = worse, but numerically lower = worse. A name like mfuEfficiencyFactor or computeScaleFactor would be less surprising:

// Before
penalty := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= penalty

// Clearer
mfuEfficiencyFactor := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= mfuEfficiencyFactor

The comment on lines 283–284 partially compensates but the variable name itself will confuse the next reader who isn't expecting a divisor.

5. BC-6 test is structurally coupled to formula internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible (roofline_test.go) manually reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and the roofline formula to derive expected. Per the BDD/TDD standards: "Would this test still pass if the implementation were completely rewritten but the behavior preserved?" — no, because it hardcodes the two-loop structure and clampToInt64 call.

A simpler behavioral alternative: capture a known-good expected value (golden-style) or assert that zero-penalty results are <= any nonzero-penalty result for the same batch. The backward-compatibility law can be tested without reimplementing the implementation. This is a suggestion, not a blocker — the test does provide strong regression coverage.


Standards Compliance Summary

Rule Status
R3: numeric parameter validation ✅ Both fields validated in ValidateRooflineConfig
R4: construction site audit ✅ No missed HardwareCalib construction sites
R6: no logrus.Fatal in sim/ ✅ Error returns only
R9: pointer types for YAML zero ambiguity ⚠️ float64 used instead of *float64; acceptable here since 0.0 = "no penalty" = "not set" semantically — the Changed() guard handles the CLI case. Not a meaningful violation given the semantics but worth noting.
R11: division guard penalty ∈ [0.5, 1.0] proven in comment
R18: CLI flag precedence Changed() guards on both flags

Test Plan Coverage

BC-1 through BC-7 tests are all present and cover the stated behavioral contracts:

  • BC-1 ✅: compute-bound mixed batch increases with penalty
  • BC-2 ✅: pure-prefill and pure-decode unaffected (two separate tests)
  • BC-3 ✅: larger minority fraction → higher penalty ratio (verified via ratio comparison, not absolute time, to handle different total token counts — correct approach)
  • BC-4 ✅: overlap penalty increases step time for any non-empty batch
  • BC-5 ✅: monotonicity of overlap penalty
  • BC-6 ✅: zero penalties reproduce original formula (structurally coupled, see suggestion 5)
  • BC-7 ✅: validation rejects out-of-range and NaN/Inf values for both fields

The testHardwareCalib() helper correctly leaves the new fields zero (Go zero-value), so all 235 pre-existing roofline tests act as BC-6-equivalent coverage implicitly.


Verdict

Ready to merge after addressing documentation gaps 1–3. The implementation is correct, backward-compatible, localized to rooflineStepTime, and well-tested. Issues 1–3 are user-facing documentation that new contributors and GPU calibrators will hit. Issues 4–5 are low-priority suggestions.

@namasl
Copy link
Copy Markdown
Contributor Author

namasl commented Apr 7, 2026

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

Claude finished @namasl's task in 1m 17s —— View job


PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:


Summary

The PR correctly addresses both gaps from issue #949 with clean, localized changes. The implementation is sound and backward-compatible. The previous automated review incorrectly flagged three documentation gaps that are all already present in this PR's diff. Three test quality issues need attention — one of which makes a core test vacuous.


Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation:mfuScaleFactor = 1 - MixedBatchPenalty × minorityFraction divides totalComputeS by a value in [0.5, 1.0], increasing compute time proportionally to batch imbalance. Guard condition MixedBatchPenalty > 0 && len(Prefill) > 0 && len(Decode) > 0 ensures the block is fully skipped at default 0 — exact backward compatibility.

Gap 2 — Imperfect compute/memory overlap:max(c, m) + OverlapPenalty × min(c, m) is exactly the formula proposed in the issue. At OverlapPenalty=0 it reduces to max(c, m) — exact backward compatibility.

R11 (division guard):mfuScaleFactor ∈ [0.5, 1.0] — mathematically proven (MixedBatchPenalty ∈ [0,1] × minorityFraction ∈ [0, 0.5] ≤ 0.5). Comment on lines 283–284 documents this. Division by totalTokens at line 281 is guarded by the outer condition requiring both PrefillRequests and DecodeRequests to be non-empty.

R18 (CLI flag precedence): ✅ Both flags use cmd.Flags().Changed() guards in resolveLatencyConfig.

R4 (construction site audit): ✅ Only two HardwareCalib{} production sites exist (config.go:89,98), both error returns that correctly zero-initialize. All test sites either use the blackbox backend (penalty never reached) or zero-initialize correctly.


Issues Found

Important — Test Quality

1. BC-3 is vacuous: the assertion is never exercised

TestRooflineStepTime_MixedBatchPenalty_SymmetricMinority (roofline_test.go:1211) is intended to verify that a 50/50 token split produces a higher penalty ratio than a 90/10 split. However, unlike BC-1 which explicitly sets BwPeakTBs = 1e6 to force compute-bound behavior, BC-3 uses testHardwareCalib() with standard H100 specs (BwPeakTBs = 3.35).

MixedBatchPenalty exclusively scales totalComputeS. For a 32-layer/4096-hidden model, weight bandwidth alone (~15 GB / 3.35 TB/s ≈ 4.5 ms) dwarfs compute time for 50–100 token batches (~256 GFLOPS / (989 TFLOPS × 0.55) ≈ 0.47 ms). Both batches are memory-bound, so max(totalComputeS, totalMemoryS) = totalMemoryS regardless of the penalty, and both ratios equal 1.0. The assertion ratio5050 >= ratio9010 becomes 1.0 >= 1.0 — trivially true. Deleting the entire MixedBatchPenalty block from roofline.go would not cause this test to fail.

Fix: add hc.BwPeakTBs = 1e6 and hcNoPenalty.BwPeakTBs = 1e6 in BC-3, matching the established technique from BC-1.

Fix this →

2. Flag parity tests don't cover new flags

Two tests in cmd/simconfig_shared_test.go maintain hardcoded flag lists:

  • TestRunCmd_SimConfigFlagsParity (line 76) — missing "mixed-batch-penalty" and "overlap-penalty"
  • TestBothCommands_SimConfigFlagsHaveIdenticalDefaults (line 163) — also missing both new flags

While the flags ARE correctly registered for both commands via registerSimConfigFlags (replay.go:303), these parity guards exist specifically to catch future regressions. Omitting the new flags creates a blind spot.

Fix this →

Suggestions

3. No test for MixedBatchPenalty on a memory-bound mixed batch (no-op confirmation)

BC-2 tests pure-prefill and pure-decode batches. It doesn't test the case where a mixed batch is memory-bound — the penalty should still be a no-op in this regime (the penalty applies to totalComputeS, which is not the bottleneck). This would be a useful regression guard in case someone accidentally applies the penalty to totalMemoryS instead.

4. OverlapPenalty = 1.0 boundary case has no behavioral test

At OverlapPenalty = 1.0, the formula reduces to c + m (additive rather than max). BC-7 validates that 1.0 is accepted, but no behavioral test checks the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-facing case and a natural invariant to verify.

5. BC-6 is structurally coupled (per BDD/TDD standards)

TestRooflineStepTime_ZeroPenalties_BackwardCompatible reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline. Per the project's refactoring survival test ("Would this test still pass if the implementation were completely rewritten?") — no, it would fail. The behavioral contract ("zero penalties produce identical output to pre-#949") could be tested with a physics invariant instead: for a memory-bound step, result × peakBW ≈ totalBytes; compare against the existing TestRooflineStepTime_SingleCrossover_MemoryBoundDecode approach. This is low priority but noted for completeness.

Minor Documentation Gap

6. roofline.md Section 2 formula is outdated

Sections 2–4 of docs/concepts/roofline.md describe the mathematical formula as max(FLOPs/Peak, Bytes/BW). After this PR, the actual formula is max(c, m) + OverlapPenalty × min(c, m) for non-default configs. The new fields are correctly documented in the table (lines 104–105), but the conceptual formula in the overview section doesn't reflect the change.

Fix this →


Correction to Previous Review

The previous automated review identified three "Important" documentation gaps. All three are already present in this PR:

  • docs/concepts/roofline.md lines 104–105: ✅ both new fields listed in the table
  • docs/guide/latency-models.md line 70: ✅ "optional interference fields" added
  • docs/reference/configuration.md: ✅ both CLI flags added to the table

Standards Compliance

Rule Status Notes
R1: No silent data loss
R2: Map key sorting No map iteration in new code
R3: Numeric validation NaN/Inf/<0/>1 all rejected; NaN correctly passes != 0 guard
R4: Construction site audit Only error-return sites in production code; all zero-initialize correctly
R6: No logrus.Fatal in sim/
R9: *float64 for YAML zero-value ⚠️ float64 used; acceptable since 0.0 ≡ "not set" for these fields
R11: Division guard mfuScaleFactor ∈ [0.5, 1.0] proven; totalTokens guarded by non-empty slice condition
R18: CLI flag precedence Changed() guards on both flags

Verdict

Recommend fixing issues 1–2 before merge — BC-3 is currently vacuous (deleting the implementation wouldn't break it), and the parity test omission leaves a regression gap. Issues 3–6 can go to a follow-up.

@namasl
Copy link
Copy Markdown
Contributor Author

namasl commented Apr 7, 2026

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

Claude finished @namasl's task in 3m 47s —— View job


PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:


Summary

This is the third review pass. The first two automated reviews identified issues that the author has since resolved. I re-read all 11 changed files from scratch against the current HEAD. All important issues are resolved. The PR is ready to merge, with two residual low-priority suggestions and one new observation.


Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation: ✅ Formula correct. Guard condition MixedBatchPenalty > 0 && len(PrefillRequests) > 0 && len(DecodeRequests) > 0 ensures exact backward compatibility at default 0. The variable is named mfuScaleFactor (fixed from penalty in previous review). Division guard proven in comment: mfuScaleFactor ∈ [0.5, 1.0] since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5].

Gap 2 — Imperfect compute/memory overlap: ✅ Formula max(c,m) + OverlapPenalty × min(c,m) exactly matches the issue proposal. At OverlapPenalty=0 reduces to max(c,m) — exact backward compatibility.

Scope note: The penalties apply only to RooflineLatencyModel (--latency-model roofline). The TrainedRooflineLatencyModel (--latency-model trained-roofline) has its own learned corrections in trained_roofline.go and is not affected by these parameters. This is architecturally correct — the trained model's beta coefficients capture these effects through regression — but worth noting for users who configure hardware_config.json penalties and then switch backends.


Previous Review Issues — Resolution Status

All items flagged as "Important" in the second automated review have been resolved in the latest two commits:

Issue Status
BC-3 vacuous (memory-bound regime) ✅ Fixed: hc.BwPeakTBs = 1e6 added to both hc and hcNoPenalty in BC-3
Flag parity tests missing new flags ✅ Fixed: "mixed-batch-penalty" and "overlap-penalty" added to both lists in cmd/simconfig_shared_test.go
Formula docs outdated in roofline.md ✅ Fixed: conceptual section now documents both corrections with formulas
roofline.md table missing new fields ✅ Present: lines 109–110 in current file
latency-models.md missing optional fields ✅ Present: line 70 mentions interference fields
configuration.md missing CLI flags ✅ Present: both flags added to the table
Variable name penalty (inverted semantics) ✅ Fixed: renamed to mfuScaleFactor

Standards Compliance

Rule Status Notes
R1: No silent data loss
R3: Numeric validation NaN, Inf, <0, >1 all rejected. NaN != 0 is true in IEEE 754 so NaN correctly enters the validation block; IsNaN then catches it. -Inf < 0 and +Inf > 1 also caught by range checks.
R4: Construction site audit Both production construction sites (config.go:89,98) are error-return paths that zero-initialize correctly. All test sites either use blackbox backend or zero-initialize.
R6: No logrus.Fatal in sim/
R9: *float64 for YAML zero-value ambiguity N/A Hardware config uses encoding/json (non-strict, unknown fields ignored). 0.0 ≡ "not set" is semantically unambiguous here — no Inf/NaN values masquerade as "unset".
R11: Division guard mfuScaleFactor ∈ [0.5, 1.0] proven. totalTokens division guarded by non-empty slice precondition.
R18: CLI flag precedence Changed() guards on both flags in resolveLatencyConfig.

Test Coverage Assessment

All BC-1 through BC-7 tests are present and, after the BC-3 fix, non-vacuous:

Test Verdict
BC-1: penalty strictly increases compute-bound mixed batch ✅ Non-vacuous: BwPeakTBs=1e6 forces compute-bound
BC-2: pure-prefill and pure-decode unaffected ✅ Uses standard H100 — memory-bound, penalty correctly skipped by guard
BC-3: 50/50 ratio ≥ 90/10 ratio ✅ Fixed: both hc and hcNoPenalty now use BwPeakTBs=1e6
BC-4: overlap penalty increases step time for any non-empty batch ✅ Valid: min(compute,memory) > 0 always for non-empty batch
BC-5: overlap penalty monotonicity
BC-6: zero penalties reproduce original formula ✅ (structurally coupled — see suggestion below)
BC-7: validation rejects out-of-range and NaN/Inf ✅ 6 tests covering negative, >1, and NaN for each field; Inf caught by range tests

Pre-existing tests serve as implicit BC-6 coverage: all 235+ prior roofline tests use testHardwareCalib() which zero-initializes both new fields, continuously validating backward compatibility.


Remaining Suggestions (non-blocking)

1. BC-6 is structurally coupled to implementation internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline to derive expected. Per the BDD/TDD principles ("Would this test still pass if the implementation were completely rewritten?"): no, it would fail if the internal FLOP formula changed, even if the observable behavior (physics) was preserved. The 235+ pre-existing tests already provide this backward-compatibility guarantee implicitly. This is a suggestion, not a blocker — the test provides strong regression coverage.

2. No behavioral test for OverlapPenalty = 1.0 boundary

At OverlapPenalty=1.0 the formula reduces to compute + memory (fully additive). BC-7 validates that 1.0 is accepted, but no test asserts the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-configurable case and a natural physics invariant to verify. Not required for this PR's stated goals.


Verdict

Ready to merge. All critical and important issues from the two prior reviews have been resolved. The implementation is correct, backward-compatible, localized to rooflineStepTime and ValidateRooflineConfig, and well-tested. The two remaining suggestions above are follow-up candidates, not blockers.

@namasl namasl marked this pull request as draft April 8, 2026 13:38
namasl and others added 3 commits April 8, 2026 13:59
inference-sim#949)

Two gaps in rooflineStepTime() caused aggregated (colocated) mode to appear
better than real systems, masking the advantages of PD disaggregation:

Gap 1 — Mixed-batch MFU degradation: added MixedBatchPenalty to HardwareCalib.
When both prefill and decode tokens share a step, compute time is scaled by
1/(1 - penalty * minority_fraction), modeling GPU inefficiency from mixed
memory-access patterns and attention kernel tile-size mismatches.

Gap 2 — Imperfect compute/memory overlap: added OverlapPenalty to HardwareCalib.
Replaces max(compute, memory) with max(compute, memory) + penalty * min(compute,
memory), modeling SM scheduling contention between compute-bound prefill and
memory-bound decode work.

Both parameters default to 0.0 for exact backward compatibility. Validation
in ValidateRooflineConfig enforces [0, 1] range. CLI flags --mixed-batch-penalty
and --overlap-penalty override hardware_config.json values (R18 Changed() guard).
hardware_config.json updated with explicit 0.0 entries for all GPU profiles.

Behavioral contracts verified: BC-1 (penalty strictly increases compute-bound
mixed batches), BC-2 (no effect on pure-phase batches), BC-3 (larger minority
fraction → more penalty), BC-4/BC-5 (overlap penalty monotonicity), BC-6 (zero
penalties reproduce original max() formula exactly), BC-7 (validation rejects
out-of-range inputs).

Closes inference-sim#949

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…variable name

Add mixedBatchPenalty and overlapPenalty to hardware_config.json fields
table in roofline.md, latency-models.md supported hardware section, and
configuration.md CLI flags table. Rename local `penalty` variable to
`mfuScaleFactor` in rooflineStepTime for clarity since lower values mean
greater degradation (it's a divisor, not an additive penalty).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ula docs

- BC-3 (SymmetricMinority): set BwPeakTBs=1e6 on both hc and hcNoPenalty
  to force compute-bound regime, making the penalty ratio assertion
  non-trivial (was previously 1.0 >= 1.0 due to memory-bound batches)
- Add mixed-batch-penalty and overlap-penalty to both flag parity tests
  in simconfig_shared_test.go (TestRunCmd_SimConfigFlagsParity and
  TestBothCommands_SimConfigFlagsHaveIdenticalDefaults)
- Update roofline.md Section 2 formula to document the optional
  mixed-batch and overlap penalty corrections

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@namasl namasl force-pushed the fix/949-roofline-prefill-decode-interference branch from 3a25c4c to fa09485 Compare April 8, 2026 13:59
@namasl
Copy link
Copy Markdown
Contributor Author

namasl commented Apr 8, 2026

@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 8, 2026

Claude finished @namasl's task in 6m 0s —— View job


PR Review: fix(latency): model prefill-decode interference in aggregated roofline

Tasks:

  • Initialize review
  • Read changed files from current HEAD
  • Review correctness vs parent issue Roofline latency model underestimates prefill-decode interference in aggregated mode #949
  • Check standards compliance (rules.md, invariants.md, principles.md)
  • Review behavioral tests and non-structural tests
  • Check API boundaries, contracts, modularity
  • Check documentation correctness and completeness
  • Run tests (Go not available in this environment — reviewed by code inspection + CI)
  • Compile and post final review

Summary

Fourth review pass. All important issues from prior reviews have been resolved. The implementation is correct, backward-compatible, and well-tested. One new minor observation; two low-priority suggestions from review #3 remain open. Ready to merge.


Correctness vs Issue #949

Gap 1 — Mixed-batch MFU degradation (roofline.go:329-341): ✅

  • Formula: mfuScaleFactor = 1.0 - MixedBatchPenalty × minorityFraction; divides totalComputeS.
  • Guard: MixedBatchPenalty > 0 && len(PrefillRequests) > 0 && len(DecodeRequests) > 0 — block fully skipped at default 0.
  • Range proof in comment: mfuScaleFactor ∈ [0.5, 1.0] since MixedBatchPenalty ∈ [0,1] and minorityFraction ∈ [0, 0.5]. No division-by-zero possible.
  • Variable name mfuScaleFactor (not the confusing penalty from earlier drafts) — correct.

Gap 2 — Imperfect compute/memory overlap (roofline.go:361): ✅

  • Formula: max(c,m) + OverlapPenalty × min(c,m) — exact match to issue proposal.
  • At OverlapPenalty=0: reduces to max(c,m) — exact backward compatibility.

Scope boundary (note for reviewers): Both penalties apply only to --latency-model roofline. The trained-roofline backend learns its own correction coefficients (the beta array) through regression — it correctly does not use these parameters. This is architecturally sound but worth noting for users who configure hardware_config.json penalties and then switch backends.


Resolution Status of Prior Review Issues

Issue Status
BC-3 vacuous (memory-bound regime) ✅ Fixed: BwPeakTBs = 1e6 on both hc and hcNoPenalty
Flag parity tests missing new flags ✅ Fixed: both flags in simconfig_shared_test.go:80,171
roofline.md formula section outdated ✅ Fixed: Section 2 documents both corrections with formulas
roofline.md table missing new fields ✅ Present: lines 111–112
latency-models.md missing interference fields ✅ Present: line 75
configuration.md missing CLI flags ✅ Present: lines 159–160
Variable name penalty (inverted semantics) ✅ Fixed: renamed to mfuScaleFactor

Standards Compliance

Rule Status Notes
R3: Numeric validation NaN, Inf, negative, >1 all rejected. NaN != 0 is true in IEEE 754 — NaN correctly enters validation block; IsNaN catches it.
R4: Construction site audit Both production HardwareCalib{} sites are error-return paths that zero-initialize correctly. No missed sites.
R6: No logrus.Fatal in sim/
R9: *float64 for YAML zero-value N/A Uses encoding/json (lenient), not strict YAML. 0.0 ≡ "not set" is unambiguous.
R11: Division guard mfuScaleFactor ∈ [0.5, 1.0] proven. totalTokens guarded by non-empty slice precondition on the outer if.
R18: CLI flag precedence Changed() guards at root.go:685,688.

Hardware config backward compatibility: encoding/json zero-initializes missing fields. Existing hardware_config.json files without mixedBatchPenalty/overlapPenalty will parse correctly with both = 0.0 (no-penalty behavior). ✅


Test Coverage

Test Status
BC-1: Penalty strictly increases compute-bound mixed batch ✅ Non-vacuous: BwPeakTBs=1e6 forces compute-bound
BC-2: Pure-prefill/decode unaffected ✅ Guard condition skips block entirely
BC-3: 50/50 ratio ≥ 90/10 ratio ✅ Fixed: both hc and hcNoPenalty use BwPeakTBs=1e6
BC-4: OverlapPenalty increases step time for non-empty batch min(compute,memory) > 0 guaranteed
BC-5: OverlapPenalty monotonicity
BC-6: Zero penalties reproduce original formula ✅ (structurally coupled — see suggestion below)
BC-7: Validation rejects negative, >1, NaN for both fields ✅ 6 reject tests + 1 valid-boundary table test

Pre-existing 235+ roofline tests (using testHardwareCalib() with zero-initialized new fields) serve as continuous BC-6-equivalent regression coverage.


New Observation

CLI validation pathway for out-of-range penalty flags

If a user passes --mixed-batch-penalty 1.5, the Changed() guard sets hwConfig.MixedBatchPenalty = 1.5 without early CLI-side validation. The value is caught downstream: ValidateRooflineConfig inside NewLatencyModel returns an error → NewInstanceSimulator panics with a descriptive message: "NewInstanceSimulator(…): NewLatencyModel: latency model: invalid roofline config: HardwareCalib.MixedBatchPenalty must be in [0, 1] when set, got 1.5".

This is consistent with how other NewInstanceSimulator failures behave. The panic message is clear. An early logrus.Fatalf in resolveLatencyConfig (matching the alpha/beta validation pattern at root.go:333-339) would be more user-friendly, but this is a minor polish item, not a blocker.


Remaining Low-Priority Suggestions (unchanged from review #3)

1. BC-6 is structurally coupled to implementation internals

TestRooflineStepTime_ZeroPenalties_BackwardCompatible (roofline_test.go:1447) reimplements calculateTransformerFlops, calculateMemoryAccessBytes, and clampToInt64 inline. Per BDD/TDD standards ("Would this test still pass if the implementation were completely rewritten?"): no. The 235+ pre-existing tests already provide the backward-compatibility guarantee implicitly. Consider replacing with a simpler physics invariant instead. Not a blocker.

2. No behavioral test for OverlapPenalty = 1.0 boundary

At OverlapPenalty=1.0, the formula reduces to compute + memory (fully additive). BC-7 validates that 1.0 is accepted, but no test asserts that the resulting step time equals totalComputeS + totalMemoryS. This is the most extreme user-configurable case and a natural physics invariant. Not a blocker.


Verdict

Ready to merge. The implementation is correct, backward-compatible, localized to rooflineStepTime and ValidateRooflineConfig, and all critical/important issues from prior reviews have been resolved. The two remaining suggestions above and the new CLI-validation observation are polish items for a follow-up, not blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roofline latency model underestimates prefill-decode interference in aggregated mode

1 participant