fix(cluster): PD TTFT includes KV transfer and first decode step#931
fix(cluster): PD TTFT includes KV transfer and first decode step#931namasl wants to merge 7 commits intoinference-sim:mainfrom
Conversation
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 13m 12s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve with suggestionsThe fix is correct and addresses the issue accurately. The implementation is clean, the formula is sound, and the core behavioral contracts are tested. A few test-quality improvements are worth considering before merge. Correctness AssessmentDoes the PR tackle the parent issue correctly? Yes. The root cause (issue #930) was that The implementation at TTFTSum adjustment (BC-3): R1 compliance (no silent data loss): Satisfied. Defensive fallback: Appropriate. The Invariants: INV-5 (causality: TTFT ≤ E2E) is structurally preserved — TTFT covers Issues FoundImportant1. BC-3 (TTFTSum consistency) is not directly tested [ The new test verifies BC-1/BC-2/BC-4 but not BC-3. After Suggested addition to // BC-3: TTFTSum must be consistent with RequestTTFTs.
var manualSum float64
for _, v := range m.RequestTTFTs {
manualSum += v
}
if math.Abs(float64(m.TTFTSum)-manualSum) > 1.0 { // 1 tick tolerance for rounding
t.Errorf("BC-3: TTFTSum (%d) != sum(RequestTTFTs) (%.1f)", m.TTFTSum, manualSum)
}2. Defensive fallback path is untested [ The Suggestions3. Use // Current (structural — accesses unexported field):
for _, parent := range cs.parentRequests {
// Better (behavioral — uses exported API, follows newer test pattern in the same file):
for _, parent := range cs.ParentRequests() {The test file is in 4. BC-4 check should assert E2E presence unconditionally [ // Current: silently skips causality check if E2E absent
e2e, hasE2E := m.RequestE2Es[pid]
if hasE2E && ttft > e2e {
// Better: missing E2E is itself a violation for a completed parent
e2e, hasE2E := m.RequestE2Es[pid]
if !hasE2E {
t.Errorf("BC-4: parent %s: E2E missing from RequestE2Es for completed parent", pid)
} else if ttft > e2e {
t.Errorf("BC-4: parent %s: TTFT (%.1f) > E2E (%.1f), causality violated", pid, ttft, e2e)
}If Separation of Concerns / ArchitectureNo concerns. The change is confined to PerformanceNo concerns. The hotpath is unchanged. DocumentationThe inline comment at |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 4m 10s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — one test naming issue to fixThe fix is correct, well-scoped, and addresses issue #930 accurately. The previous round of review feedback (BC-3 test, BC-4 E2E assertion, CorrectnessDoes this PR tackle the parent issue correctly? Yes. The root cause was that The implementation at BC-3 (TTFTSum consistency): R1 (no silent data loss): Satisfied. R6 (no logrus.Fatal in sim/): Clean. The new warning uses INV-PD-6 (metric map parent granularity): Maintained. Both INV-5 (causality: TTFT ≤ E2E): Structurally preserved. TTFT covers INV-PD-4 (phase causality): int64 truncation: Issues FoundImportant
The test name promises to verify the The test is useful — it verifies R1 (no silent data loss) for the primary path — but it doesn't exercise the fallback code path at all. There are two options: Option A (rename — minimal change): Rename to reflect what it actually tests: // TestDisaggregation_TTFT_NoSilentDrops verifies R1: all completed PD parents
// have a TTFT entry after projection, regardless of which branch fired.
func TestDisaggregation_TTFT_NoSilentDrops(t *testing.T) {Option B (inject fallback directly — stronger): Since the fallback cannot be triggered via a well-formed cluster simulation, test it by calling func TestDisaggregation_TTFT_FallbackWhenDecodeDataMissing(t *testing.T) {
// Build a minimal merged metrics map with a prefill TTFT but no transfer data.
parent := &ParentRequest{
ID: "p1", PrefillSubReqID: "p1_prefill", DecodeSubReqID: "p1_decode",
CompletionTime: 1000, ArrivalTime: 0,
DecodeInstanceID: "inst-0",
TransferStartTime: 0, TransferCompleteTime: 0, // missing transfer data
}
m := &sim.RawMetrics{...}
m.RequestTTFTs["p1_prefill"] = 500.0
// call projectPDMetrics with this parent...
// assert RequestTTFTs["p1"] == 500.0 (fallback value)
}Option A is the minimum fix to close the misleading name. Option B gives actual coverage of the branch. Fix this → Standards Compliance Checklist (R1–R23)
Behavioral Tests AssessmentWhat's covered:
What's not covered:
Both uncovered branches are edge/error paths, not the happy path — the core fix is solid. Separation of Concerns / ArchitectureNo concerns. The change is entirely within PerformanceNo concerns. DocumentationThe inline comment at Summary: Fix is correct and complete. The second commit addressed all prior review feedback. The one remaining action item is fixing the fallback test name (Option A, one-line change) or adding actual fallback coverage (Option B). |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 5m 28s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — two minor hygiene itemsThis is the third review pass. The fix is correct and the tests are now solid. Both substantive issues from round 2 have been resolved:
Two small hygiene items remain. CorrectnessDoes this PR tackle issue #930 correctly? Yes. The root cause was that is implemented exactly at TTFTSum adjustment ( R1 compliance: R6 compliance: New warning path uses INV-5 (causality TTFT ≤ E2E): Structurally preserved — TTFT covers INV-6 (determinism): The outer loop TTFT distribution output: Computed from sorted Standards Compliance (R1–R23)
Issues FoundHygiene1. BC-3 test accumulates var manualSum float64
for _, v := range m.RequestTTFTs { // ← unsorted map iteration (R2)
manualSum += v
}Technically violates R2 (unsorted map iteration feeding a float sum). In practice, all values are integer-valued float64s (derived from int64 timestamps), so float addition order doesn't matter and the result is deterministic. The previous round 2 review flagged this as " for _, k := range sortedKeys(m.RequestTTFTs) {
manualSum += m.RequestTTFTs[k]
}Not blocking, but worth fixing for pattern consistency since the project's standards take R2 seriously. Fix this → 2. Decode sub-request key assertion in fallback test is vacuously true [ if _, exists := m.RequestTTFTs[tc.parent.DecodeSubReqID]; exists {
t.Errorf("INV-PD-6: decode sub-request key %s still present", tc.parent.DecodeSubReqID)
}The fallback test pre-populates only Behavioral Tests Coverage
Separation of Concerns / ArchitectureNo concerns. Change is entirely within PerformanceNo concerns. DocumentationInline comment at Summary: Fix is correct and complete. Both issues from round 2 are resolved. The two remaining items (R2 in BC-3 test, vacuous decode-key assertion in fallback test) are hygiene-level and non-blocking. |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 10s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — one documentation fix needed before mergeThis is the fourth review pass. The fix is correct and the two round 3 hygiene items are addressed. One documentation accuracy issue discovered in a file outside the diff that this PR leaves stale. CorrectnessDoes this PR tackle issue #930 correctly? Yes. The formula R1: R6: Both warning paths use int64 truncation: INV-5 (causality): INV-6 (determinism): The outer loop iterates BC-3 (TTFTSum consistency): INV-PD-4 (transfer start ≤ complete): Verified in Issues FoundImportant (fix before merge)Stale comment in // pd_metrics.go:113–116 (CURRENT — now incorrect)
// BC-1: parent TTFT from the aggregated TTFT map, keyed by parent ID.
// After projectPDMetrics(), the prefill sub-request TTFT has been rekeyed
// from PrefillSubReqID to the parent ID. The value is unchanged: first_token_time -
// arrival_time, recorded when ProgressIndex == inputLen (prefill completion)."The value is unchanged" is now wrong. After this PR, the projected value is Suggested replacement: // BC-1: parent TTFT from the aggregated TTFT map, keyed by parent ID.
// After projectPDMetrics(), the user-visible TTFT is keyed by the parent ID.
// Value = prefillTTFT + KV transfer duration + first decode step — the full
// time from arrival to the first token reaching the user (issue #930).
// Missing key returns 0.0; exclude 0.0 values (BC-11).HygieneGuard inconsistency between
// cluster.go:1193 — current
if hasPrefillTTFT && parent.TransferCompleteTime > 0 && parent.DecodeSubReq != nil && len(parent.DecodeSubReq.ITL) > 0 {
// Suggested — aligned with pd_metrics.go:124
if hasPrefillTTFT && parent.TransferStartTime > 0 && parent.TransferCompleteTime >= parent.TransferStartTime && parent.DecodeSubReq != nil && len(parent.DecodeSubReq.ITL) > 0 {Standards Compliance (R1–R23)
Behavioral Test Coverage
Tests are behavioral (observable post-conditions via public APIs), meet the refactor survival criterion, and cover all branches including the defensive fallback. The fallback test correctly exercises the unreachable-from-well-formed-sim branch via synthetic struct injection. No critical gaps. Separation of Concerns / ArchitectureNo concerns. The change is confined to PerformanceNo concerns. Documentation (PR-level)Inline comment at Summary: Fix is correct and test coverage is comprehensive. The only action item before merge is updating the stale comment in |
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 8m 21s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — two advisory itemsThis is round 5. All prior round feedback (R2 sortedKeys, BC-2 non-triviality guards, decode-key pre-population in fallback test, BC-4 hard failure on missing E2E) has been addressed correctly. The implementation is correct and the test suite is solid. Two advisory items remain, neither blocking. CorrectnessDoes this PR tackle issue #930 correctly? Yes. The formula R1/R2/R6/R7/R14: All pass. See prior rounds; no regressions introduced. Advisory Items1. Guard inconsistency:
|
| Rule | Status | Notes |
|---|---|---|
| R1 (no silent data loss) | ✅ | Read before delete; all branches warn or produce valid output |
| R2 (sort map keys before float accumulation) | ✅ | BC-3 test uses sortedKeys(m.RequestTTFTs) |
| R6 (no logrus.Fatal in sim/) | ✅ | All new log calls use Warnf |
| R7 (invariant tests alongside golden tests) | ✅ | Three new invariant tests |
| R14 (no multi-module methods) | ✅ | projectPDMetrics single-concern |
| Others | ✅ | No new config, interfaces, or YAML |
Behavioral Tests Coverage
| Contract | Test | Status |
|---|---|---|
| BC-1 (exact formula) | TestDisaggregation_TTFT_IncludesTransferAndDecode |
✅ |
| BC-2 (TTFT > prefill-only, with non-triviality guards) | TestDisaggregation_TTFT_IncludesTransferAndDecode |
✅ |
| BC-3 (TTFTSum consistency, sortedKeys) | TestDisaggregation_TTFT_IncludesTransferAndDecode |
✅ |
| BC-4 (TTFT ≤ E2E, hard failure on missing E2E) | TestDisaggregation_TTFT_IncludesTransferAndDecode |
✅ |
| BC-5 (non-PD unaffected) | TestDisaggregation_MetricProjection_NoOp (pre-existing) |
✅ |
| R1 (no silent drops) | TestDisaggregation_TTFT_NoSilentDrops |
✅ |
| Fallback: TransferCompleteTime=0 | TestDisaggregation_TTFT_FallbackWhenDecodeDataMissing |
✅ |
| Fallback: empty ITL | TestDisaggregation_TTFT_FallbackWhenDecodeDataMissing |
✅ |
| Fallback: nil DecodeSubReq | TestDisaggregation_TTFT_FallbackWhenDecodeDataMissing |
✅ |
| Fallback: no prefill TTFT key (Branch C) | — |
Performance / Architecture / Documentation
No concerns. projectPDMetrics() is a teardown-only projection step. Module boundaries respected. Inline comment at cluster.go:1181–1187 is accurate. PR behavioral contracts map cleanly to the implementation.
Summary: Fix is correct and complete. All prior round feedback has been addressed. The two remaining advisory items (guard consistency and Branch C coverage) are non-blocking hygiene. The PR is ready to merge.
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 2m 39s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — one stale comment from round 4 still needs fixingRound 5's two advisory items (guard inconsistency, Branch C coverage) have been addressed correctly. The only remaining item is a stale comment in What was fixed since round 5Advisory 1 (guard inconsistency) — FIXED ✅ [ The guard now matches the stronger predicate established in if hasPrefillTTFT && parent.TransferStartTime > 0 && parent.TransferCompleteTime >= parent.TransferStartTime && parent.DecodeSubReq != nil && len(parent.DecodeSubReq.ITL) > 0 {If Advisory 2 (Branch C coverage) — FIXED ✅ [ The
Outstanding IssueImportant (fix before merge)Stale comment in
// BC-1: parent TTFT from the aggregated TTFT map, keyed by parent ID.
// After projectPDMetrics(), the prefill sub-request TTFT has been rekeyed
// from PrefillSubReqID to the parent ID. The value is unchanged: first_token_time -
// arrival_time, recorded when ProgressIndex == inputLen (prefill completion).
// Missing key returns 0.0 in Go maps; exclude 0.0 values (BC-11)."The value is unchanged" and "prefill completion" are both incorrect after this PR. The projected value now includes KV transfer duration and the first decode step. Any future reader of Suggested replacement: // BC-1: parent TTFT from the aggregated TTFT map, keyed by parent ID.
// After projectPDMetrics(), the user-visible TTFT is stored under the parent ID.
// Value = prefillTTFT + KV transfer duration + first decode step — the full
// time from arrival to the first token reaching the user (issue #930).
// Missing key returns 0.0 in Go maps; exclude 0.0 values (BC-11).Correctness SummaryThe fix is correct and unchanged from prior rounds:
Test Coverage Summary
All branches are exercised. The fallback test correctly pre-populates the decode sub-request key so the INV-PD-6 deletion assertions are non-trivial for all cases. Summary: Fix is correct, tests are comprehensive, and all round 5 advisory items have been addressed. The only remaining action item is updating the stale comment in |
After PR inference-sim#931 the projected TTFT value includes KV transfer duration and first decode step, not just the rekeyed prefill TTFT. Update the comment at the call site to reflect the corrected semantics (issue inference-sim#930). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 3m 16s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — ready to mergeThis is round 7. All prior round feedback has been addressed. The round 6 "Important (fix before merge)" item — the stale comment in What changed since round 6Stale comment in Old (incorrect after this PR): // After projectPDMetrics(), the prefill sub-request TTFT has been rekeyed
// from PrefillSubReqID to the parent ID. The value is unchanged: first_token_time -
// arrival_time, recorded when ProgressIndex == inputLen (prefill completion).New (correct): // After projectPDMetrics(), the user-visible TTFT is stored under the parent ID.
// Value = prefillTTFT + KV transfer duration + first decode step — the full
// time from arrival to the first token reaching the user (issue #930).This was the final blocker identified in round 4 and re-flagged in round 6. It's resolved. Correctness (unchanged from prior rounds, confirmed still holds)Does this PR tackle issue #930 correctly? Yes. Formula at All three components are sequential and additive in the llm-d PD pipeline. TTFTSum adjustment ( R1 (read before delete): R6 (no INV-5 (causality TTFT ≤ E2E): Structurally preserved — TTFT covers Guard consistency (round 5 advisory #1): Standards Compliance (R1–R23)
Behavioral Test Coverage
All branches are exercised. Tests are behavioral, use exported APIs ( Separation of Concerns / ArchitectureNo concerns. The change is entirely within PerformanceNo concerns. Documentation
Summary: The fix is correct, complete, and well-tested. All seven rounds of review feedback have been addressed. No remaining action items. |
…erence-sim#930) In PD disaggregated mode, user-visible TTFT was understated because it only captured prefill sub-request completion time. In llm-d, the first token reaches the user from the decode pod — after prefill, KV transfer, and one decode step. Update projectPDMetrics() to compute user-visible TTFT as prefillTTFT + transferDuration + firstDecodeStep. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add BC-3 test: TTFTSum consistency with sum(RequestTTFTs) after projection - Add fallback path test: completed parents always have TTFT entries (R1) - Use cs.ParentRequests() instead of cs.parentRequests (exported API) - Assert E2E presence unconditionally in BC-4 causality check Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The fallback test was renamed/split: TestDisaggregation_TTFT_NoSilentDrops verifies R1 (no dropped entries) via a normal simulation, while TestDisaggregation_TTFT_FallbackWhenDecodeDataMissing now directly exercises the defensive fallback in projectPDMetrics by constructing synthetic ParentRequests with TransferCompleteTime=0, empty ITL, and nil DecodeSubReq. Table-driven with three cases. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- BC-3 test: iterate RequestTTFTs with sortedKeys for R2 compliance - Fallback test: pre-populate decode sub-request key so INV-PD-6 deletion assertion is non-trivial Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- BC-2: add explicit non-triviality guards for transferDuration and firstDecodeStep so a config change zeroing either addend causes a clear test failure rather than a vacuously-passing assertion - BC-3: add comment explaining why tolerance is 1.0 µs (TTFTSum is int64-truncated) rather than 1e-9 used elsewhere in the file Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Align happy-path guard with established pd_metrics.go pattern (TransferStartTime > 0 && TransferCompleteTime >= TransferStartTime) to prevent silent TTFTSum corruption on invariant violation. Add Branch C coverage to FallbackWhenDecodeDataMissing: fourth sub-case exercises the else path where a completed parent has no prefill TTFT key, asserting no entry is written and INV-PD-6 holds. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
After PR inference-sim#931 the projected TTFT value includes KV transfer duration and first decode step, not just the rekeyed prefill TTFT. Update the comment at the call site to reflect the corrected semantics (issue inference-sim#930). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
c47eb95 to
0601db8
Compare
|
@claude /pr-review-toolkit:review-pr does this pr tackle the parent issue correctly? Validate correctness, ease of use, separation of concerns, adherence to our @docs/contributing/standards, invariants, behavioral discipline, behavioral tests and non-structural tests, modularity, clean api boundaries and contracts, performance implications, documentation correctness and completeness. |
|
Claude finished @namasl's task in 3m 43s —— View job PR Review: fix(cluster): PD TTFT includes KV transfer and first decode stepTasks
Verdict: Approve — ready to mergeThis is round 8. The most recent commit ( What changed since round 7Stale
// BC-1: parent TTFT from the aggregated TTFT map, keyed by parent ID.
// After projectPDMetrics(), the user-visible TTFT is stored under the parent ID.
// Value = prefillTTFT + KV transfer duration + first decode step — the full
// time from arrival to the first token reaching the user (issue #930).
// Missing key returns 0.0 in Go maps; exclude 0.0 values (BC-11).The previously incorrect "The value is unchanged: first_token_time - arrival_time, recorded when ProgressIndex == inputLen (prefill completion)" is gone. Correctness (confirmed against current code)Does this PR tackle issue #930 correctly? Yes. Formula at All three components are sequential and additive in the llm-d PD pipeline. This is the first user-visible token path: prefill pod completes → KV transfer → decode pod emits first output token.
Standards Compliance (R1–R23)
Behavioral Test Coverage
Tests use Separation of Concerns / ArchitectureNo concerns. The change is entirely within PerformanceNo concerns. Two float64 additions and one int64 cast per parent request at teardown. ✓ Documentation
Summary: Fix is correct, complete, and well-tested. All eight rounds of review feedback have been addressed across seven commits. No remaining action items. The PR is ready to merge. |
Summary
projectPDMetrics()to compute user-visible TTFT asprefillTTFT + transferDuration + firstDecodeStepTTFTSumadjusted by the delta to maintain consistency (BC-3)Closes #930
Behavioral Contracts
RequestTTFTs[parentID] = prefillTTFT + (TransferCompleteTime - TransferStartTime) + DecodeSubReq.ITL[0]TTFTSumadjusted by delta to stay consistent withRequestTTFTsTTFT <= E2E(causality preserved)projectPDMetricsis a no-op whenparentRequestsis empty)Test plan
TestDisaggregation_TTFT_IncludesTransferAndDecode— verifies BC-1 (exact formula), BC-2 (exceeds prefill-only), BC-4 (causality)go test ./...passesgo vet ./...clean🤖 Generated with Claude Code