fix(latency): model prefill-decode interference in aggregated roofline#951
fix(latency): model prefill-decode interference in aggregated roofline#951namasl wants to merge 3 commits intoinference-sim:mainfrom
Conversation
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation. |
|
Claude finished @namasl's task in 5m 23s —— View job PR Review: fix(latency): model prefill-decode interference in aggregated rooflineTasks:
SummaryThe PR correctly addresses both gaps identified in issue #949 with clean, localized changes. Core implementation and test plan are solid. Three documentation gaps need attention before merge. Correctness vs Issue #949Gap 1 — Mixed-batch MFU degradation: ✅ Correctly implemented. The formula Gap 2 — Imperfect compute/memory overlap: ✅ Correctly implemented. R11 (division guard): ✅ R18 (CLI flag precedence): ✅ Both flags use R4 (construction site audit): ✅ Issues FoundImportant1. Lines 99–105 list
2. The "Supported hardware" section states that the required fields are 3. The configuration reference has a table of CLI flags. Suggestions4. Variable naming: In // Before
penalty := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= penalty
// Clearer
mfuEfficiencyFactor := 1.0 - hwConfig.MixedBatchPenalty * minorityFraction
totalComputeS /= mfuEfficiencyFactorThe comment on lines 283–284 partially compensates but the variable name itself will confuse the next reader who isn't expecting a divisor. 5. BC-6 test is structurally coupled to formula internals
A simpler behavioral alternative: capture a known-good expected value (golden-style) or assert that zero-penalty results are Standards Compliance Summary
Test Plan CoverageBC-1 through BC-7 tests are all present and cover the stated behavioral contracts:
The VerdictReady to merge after addressing documentation gaps 1–3. The implementation is correct, backward-compatible, localized to |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation. |
|
Claude finished @namasl's task in 1m 17s —— View job PR Review: fix(latency): model prefill-decode interference in aggregated rooflineTasks:
SummaryThe PR correctly addresses both gaps from issue #949 with clean, localized changes. The implementation is sound and backward-compatible. The previous automated review incorrectly flagged three documentation gaps that are all already present in this PR's diff. Three test quality issues need attention — one of which makes a core test vacuous. Correctness vs Issue #949Gap 1 — Mixed-batch MFU degradation: ✅ Gap 2 — Imperfect compute/memory overlap: ✅ R11 (division guard): ✅ R18 (CLI flag precedence): ✅ Both flags use R4 (construction site audit): ✅ Only two Issues FoundImportant — Test Quality1. BC-3 is vacuous: the assertion is never exercised
Fix: add 2. Flag parity tests don't cover new flags Two tests in
While the flags ARE correctly registered for both commands via Suggestions3. No test for BC-2 tests pure-prefill and pure-decode batches. It doesn't test the case where a mixed batch is memory-bound — the penalty should still be a no-op in this regime (the penalty applies to 4. At 5. BC-6 is structurally coupled (per BDD/TDD standards)
Minor Documentation Gap6. Sections 2–4 of Correction to Previous ReviewThe previous automated review identified three "Important" documentation gaps. All three are already present in this PR:
Standards Compliance
VerdictRecommend fixing issues 1–2 before merge — BC-3 is currently vacuous (deleting the implementation wouldn't break it), and the parity test omission leaves a regression gap. Issues 3–6 can go to a follow-up. |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness - including user and developer documentation. |
|
Claude finished @namasl's task in 3m 47s —— View job PR Review: fix(latency): model prefill-decode interference in aggregated rooflineTasks:
SummaryThis is the third review pass. The first two automated reviews identified issues that the author has since resolved. I re-read all 11 changed files from scratch against the current HEAD. All important issues are resolved. The PR is ready to merge, with two residual low-priority suggestions and one new observation. Correctness vs Issue #949Gap 1 — Mixed-batch MFU degradation: ✅ Formula correct. Guard condition Gap 2 — Imperfect compute/memory overlap: ✅ Formula Scope note: The penalties apply only to Previous Review Issues — Resolution StatusAll items flagged as "Important" in the second automated review have been resolved in the latest two commits:
Standards Compliance
Test Coverage AssessmentAll BC-1 through BC-7 tests are present and, after the BC-3 fix, non-vacuous:
Pre-existing tests serve as implicit BC-6 coverage: all 235+ prior roofline tests use Remaining Suggestions (non-blocking)1. BC-6 is structurally coupled to implementation internals
2. No behavioral test for At VerdictReady to merge. All critical and important issues from the two prior reviews have been resolved. The implementation is correct, backward-compatible, localized to |
inference-sim#949) Two gaps in rooflineStepTime() caused aggregated (colocated) mode to appear better than real systems, masking the advantages of PD disaggregation: Gap 1 — Mixed-batch MFU degradation: added MixedBatchPenalty to HardwareCalib. When both prefill and decode tokens share a step, compute time is scaled by 1/(1 - penalty * minority_fraction), modeling GPU inefficiency from mixed memory-access patterns and attention kernel tile-size mismatches. Gap 2 — Imperfect compute/memory overlap: added OverlapPenalty to HardwareCalib. Replaces max(compute, memory) with max(compute, memory) + penalty * min(compute, memory), modeling SM scheduling contention between compute-bound prefill and memory-bound decode work. Both parameters default to 0.0 for exact backward compatibility. Validation in ValidateRooflineConfig enforces [0, 1] range. CLI flags --mixed-batch-penalty and --overlap-penalty override hardware_config.json values (R18 Changed() guard). hardware_config.json updated with explicit 0.0 entries for all GPU profiles. Behavioral contracts verified: BC-1 (penalty strictly increases compute-bound mixed batches), BC-2 (no effect on pure-phase batches), BC-3 (larger minority fraction → more penalty), BC-4/BC-5 (overlap penalty monotonicity), BC-6 (zero penalties reproduce original max() formula exactly), BC-7 (validation rejects out-of-range inputs). Closes inference-sim#949 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…variable name Add mixedBatchPenalty and overlapPenalty to hardware_config.json fields table in roofline.md, latency-models.md supported hardware section, and configuration.md CLI flags table. Rename local `penalty` variable to `mfuScaleFactor` in rooflineStepTime for clarity since lower values mean greater degradation (it's a divisor, not an additive penalty). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ula docs - BC-3 (SymmetricMinority): set BwPeakTBs=1e6 on both hc and hcNoPenalty to force compute-bound regime, making the penalty ratio assertion non-trivial (was previously 1.0 >= 1.0 due to memory-bound batches) - Add mixed-batch-penalty and overlap-penalty to both flag parity tests in simconfig_shared_test.go (TestRunCmd_SimConfigFlagsParity and TestBothCommands_SimConfigFlagsHaveIdenticalDefaults) - Update roofline.md Section 2 formula to document the optional mixed-batch and overlap penalty corrections Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
3a25c4c to
fa09485
Compare
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 6m 0s —— View job PR Review: fix(latency): model prefill-decode interference in aggregated rooflineTasks:
SummaryFourth review pass. All important issues from prior reviews have been resolved. The implementation is correct, backward-compatible, and well-tested. One new minor observation; two low-priority suggestions from review #3 remain open. Ready to merge. Correctness vs Issue #949Gap 1 — Mixed-batch MFU degradation (
Gap 2 — Imperfect compute/memory overlap (
Scope boundary (note for reviewers): Both penalties apply only to Resolution Status of Prior Review Issues
Standards Compliance
Hardware config backward compatibility: Test Coverage
Pre-existing 235+ roofline tests (using New ObservationCLI validation pathway for out-of-range penalty flags If a user passes This is consistent with how other Remaining Low-Priority Suggestions (unchanged from review #3)1. BC-6 is structurally coupled to implementation internals
2. No behavioral test for At VerdictReady to merge. The implementation is correct, backward-compatible, localized to |
Summary
MixedBatchPenaltytoHardwareCalib: scales effective MFU down by the minority-phase fraction when prefill and decode tokens share a step, modeling GPU inefficiency from mixed memory-access patterns and attention tile-size mismatchesOverlapPenaltytoHardwareCalib: replacesmax(compute, memory)withmax(compute, memory) + penalty × min(compute, memory), modeling imperfect compute/memory overlap from SM scheduling contention0.0— zero-change behavior for all existing configs and tests--mixed-batch-penaltyand--overlap-penaltywithChanged()guards (R18) allow per-run override ofhardware_config.jsonvaluesValidateRooflineConfigrejects values outside[0, 1]Closes #949
Test plan
BC-1:MixedBatchPenalty > 0strictly increases step time for compute-bound mixed batchesBC-2:MixedBatchPenaltyhas no effect on pure-prefill or pure-decode batchesBC-3: larger minority fraction (50/50 vs 90/10) produces a higher penalty ratioBC-4:OverlapPenalty > 0strictly increases step time for any non-empty batchBC-5: higherOverlapPenaltyproduces greater or equal step time (monotonicity)BC-6: both penalties at0.0reproduce the originalmax(compute, memory)result exactlyBC-7:ValidateRooflineConfigrejects negative,> 1,NaN, andInfvalues for both fieldsgo test ./...)🤖 Generated with Claude Code