feat: direct-to-decode decider for short-prompt bypass (PD Phase 2, PR4)#650
Merged
namasl merged 15 commits intoinference-sim:pdfrom Mar 14, 2026
Merged
feat: direct-to-decode decider for short-prompt bypass (PD Phase 2, PR4)#650namasl merged 15 commits intoinference-sim:pdfrom
namasl merged 15 commits intoinference-sim:pdfrom
Conversation
Routes short prompts (len(InputTokens) < threshold) directly to the decode pool, skipping disaggregation. Long prompts continue through the full prefill→transfer→decode pipeline. BC-P2-18: len(InputTokens) < threshold → Disaggregate=false BC-P2-19: empty input → Disaggregate=false Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds "direct-to-decode" to validDisaggregationDeciders map and adds factory panic case directing callers to use NewDirectToDecodeDecider. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds PDDirectDecodeThreshold field to DeploymentConfig, switch-based decider construction in NewClusterSimulator, and updates interference warning to only fire for fully-disaggregated deployments (decider=always). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ts (INV-P2-4a) When pools are configured and a request is not disaggregated, route it to the decode pool only via a new poolFilter field on RoutingDecisionEvent. Decode instances handle both phases with interference cost from PR3. BC-P2-14: non-disaggregated + pools → decode pool only BC-P2-15: non-disaggregated → no ParentRequest records Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Registers the --pd-direct-decode-threshold flag (default 256) and wires it to DeploymentConfig. Includes validation (>= 0) and stale-flag warning when used with a different decider. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Verifies short prompts route directly to decode pool while long prompts go through full PD pipeline (BC-P2-14, BC-P2-15, BC-P2-16, INV-1). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…decode - INV-P2-4a: non-disaggregated + pools → decode pool (tested with never decider) - INV-P2-4b/BC-P2-17: interference applied to mixed-phase decode batches - INV-6: determinism for mixed short/long workloads - BC-P2-13: always-disaggregate behavior unchanged by pool filter Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add --pd-direct-decode-threshold to CLI flags listing - Add DirectToDecodeDecider to disaggregation.go file description - Add PDDirectDecodeThreshold to DeploymentConfig description - Add INV-P2-4 (decode-targeted routing) to invariants section - Update disaggregated data flow with direct-to-decode decider option - Update local path to show decode pool routing (INV-P2-4) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…delines update - invariants.md: add INV-P2-3 (interference monotonicity) and INV-P2-4 (decode-targeted routing) with verification strategies - configuration.md: add direct-to-decode decider to table, semantics section, and --pd-direct-decode-threshold flag - design-guidelines.md: add DisaggregationDecider to Section 4.2 module map with all four variants Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- TestDirectToDecodeDecider_ClusterConstruction: replace nil check with behavioral assertion (decode pool routing) per BDD/TDD principles - TestDirectToDecodeDecider_MixedWorkload: tie INV-1 check to request slice lengths instead of hardcoded constant; clarify sub-request vs parent-request counting in comment - buildPoolFilteredSnapshots: only construct snapshots for target pool members; eliminates O(N/2) wasted Snapshot() calls per routing event - DisaggregationDecisionEvent.Execute: document no-pools fallback path - DirectToDecodeDecider.Decide: add empty-input note to method docstring Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Addresses findings from comprehensive PR review: Safety (C1, C3, C6, C8): - C1: Add PDTransferBaseLatencyMs upper-bound guard (<=3.6e9 ms) in NewClusterSimulator to prevent int64 overflow in transfer duration calculation (silent 1 µs clamp bug) - C3: Add PoolOverrides.Validate(name) method enforcing *TP>0, *MaxModelLen>0, *TotalKVBlocks>0; call from NewClusterSimulator before instance construction (R3) - C6: Add BlockSizeTokens>0 check in NewClusterSimulator PD block so error appears at construction time, not as a panic inside Run() (R6) - C8: Add clarifying comments above unreachable panic sites in newInstanceSimulatorCore explaining why they're safe and what to do if the function is ever called outside NewClusterSimulator Documentation (C4, C5): - C4: Update NewDisaggregationDecider docstring to include "direct-to-decode" in panic description - C5: Update PDDecider field comment to include "direct-to-decode" in the valid-values enum Test coverage (GAP-1, GAP-2, GAP-3): - GAP-1: Add INV-P2-4 pool-membership assertion to TestPrefixThreshold_HighThresholdNoDisaggregation - GAP-2: Add TestCollectPDMetrics_DroppedAtDecodeKV table-driven unit test for the TransferCompleteTime>0 && DecodeInstanceID=="" condition - GAP-3: Add INV-P2-4 pool-membership assertion to TestDisaggregationDecisionEvent_SchedulesRouting Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…erage Performance: detectPrefillCompletions/detectDecodeCompletions O(N×M) → O(K) via pendingPrefillByInstance/pendingDecodeByInstance per-instance indexing. Pending maps cleared at simulation end to release memory for dropped/horizon- truncated requests. Validation: warn when prefill+decode < total instances (unassigned idle). Tests: TestDisaggregation_MaxModelLen_DropsOversizedRequests — exact drop count, CompletedRequests verification, and INV-1 conservation check. Docs: length_capped_requests anomaly counter; README "Four latency modes"; quickstart PD examples; tutorial Step 8 PD break-even walkthrough; extension-recipes tier-composition pattern; pd-disaggregation-demo.yaml. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…rage Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…pshot, contention accuracy, and docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…, and warning accuracy - sim/disaggregation.go: add nil guard in DirectToDecodeDecider.Decide (panics with named message instead of NPE deep in event loop) - sim/cluster/cluster_event.go: fix spurious "unhandled pool role" warning — known roles without custom policies fall through silently; warning reserved for unrecognized roles only - sim/cluster/cluster_event.go: replace dead-code-path comment "When pools are NOT configured" with accurate "Defensive: always true here" note - cmd/root.go: fix misleading zero-threshold warning — qualifies "for non-empty inputs" and notes the empty-input exception - cmd/root.go: fix topology guard (&&→||) to catch the one-pool-zero case with the friendly "requires both" message Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
This was referenced Mar 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DirectToDecodeDeciderthat routes short prompts (len(InputTokens) < threshold) directly to the decode pool, skipping the prefill→transfer→decode pipelinepoolFilterfield toRoutingDecisionEventso non-disaggregated requests with pools configured route to decode pool only (INV-P2-4)--pd-direct-decode-thresholdCLI flag (default 256) and--pd-decider direct-to-decodeoptionCloses #636
Behavioral Contracts
TestDirectToDecodeDecider_BackwardCompat_AlwaysUnchangedTestDirectToDecodeDecider_PoolFilterRoutesToDecodePoolTestDirectToDecodeDecider_MixedWorkloadTestDirectToDecodeDecider_MixedWorkloadTestDirectToDecodeDecider_INVP24b_InterferenceAppliedlen < threshold→ false;>= threshold→ truesim/disaggregation_test.goTest plan
TestDirectToDecodeDecider_*(6 tests insim/disaggregation_test.go)TestDirectToDecodeDecider_MixedWorkload— 3 short + 3 long requests, verifies split routingTestDirectToDecodeDecider_INVP24a_DecodeTargetedRouting— usesneverdecider to verify invariant applies at event levelTestDirectToDecodeDecider_INVP24b_InterferenceApplied— interference increases sim time for mixed batchesTestDirectToDecodeDecider_Determinism— same seed produces identical resultsTestDirectToDecodeDecider_BackwardCompat_AlwaysUnchanged— always-disaggregate unaffectedgo test ./... -count=1— all pass🤖 Generated with Claude Code