Skip to content

feat: direct-to-decode decider for short-prompt bypass (PD Phase 2, PR4)#650

Merged
namasl merged 15 commits intoinference-sim:pdfrom
namasl:pd-pr4-direct-to-decode
Mar 14, 2026
Merged

feat: direct-to-decode decider for short-prompt bypass (PD Phase 2, PR4)#650
namasl merged 15 commits intoinference-sim:pdfrom
namasl:pd-pr4-direct-to-decode

Conversation

@namasl
Copy link
Copy Markdown
Contributor

@namasl namasl commented Mar 14, 2026

Summary

  • Add DirectToDecodeDecider that routes short prompts (len(InputTokens) < threshold) directly to the decode pool, skipping the prefill→transfer→decode pipeline
  • Add poolFilter field to RoutingDecisionEvent so non-disaggregated requests with pools configured route to decode pool only (INV-P2-4)
  • Add --pd-direct-decode-threshold CLI flag (default 256) and --pd-decider direct-to-decode option
  • Add INV-P2-4 (decode-targeted routing) invariant with verification tests

Closes #636

Behavioral Contracts

ID Contract Verified By
BC-P2-13 Other deciders unaffected TestDirectToDecodeDecider_BackwardCompat_AlwaysUnchanged
BC-P2-14 Non-disaggregated + pools → decode pool TestDirectToDecodeDecider_PoolFilterRoutesToDecodePool
BC-P2-15 Non-disaggregated → no ParentRequest TestDirectToDecodeDecider_MixedWorkload
BC-P2-16 Disaggregated → INV-PD-2 preserved TestDirectToDecodeDecider_MixedWorkload
BC-P2-17 Mixed-phase batches → interference > 1.0 TestDirectToDecodeDecider_INVP24b_InterferenceApplied
BC-P2-18 len < threshold → false; >= threshold → true Unit tests in sim/disaggregation_test.go
BC-P2-19 Empty input → false Unit test

Test plan

  • Unit tests: TestDirectToDecodeDecider_* (6 tests in sim/disaggregation_test.go)
  • Integration: TestDirectToDecodeDecider_MixedWorkload — 3 short + 3 long requests, verifies split routing
  • INV-P2-4a: TestDirectToDecodeDecider_INVP24a_DecodeTargetedRouting — uses never decider to verify invariant applies at event level
  • INV-P2-4b: TestDirectToDecodeDecider_INVP24b_InterferenceApplied — interference increases sim time for mixed batches
  • INV-6: TestDirectToDecodeDecider_Determinism — same seed produces identical results
  • BC-P2-13: TestDirectToDecodeDecider_BackwardCompat_AlwaysUnchanged — always-disaggregate unaffected
  • Full suite: go test ./... -count=1 — all pass

🤖 Generated with Claude Code

namasl and others added 15 commits March 14, 2026 05:01
Routes short prompts (len(InputTokens) < threshold) directly to the
decode pool, skipping disaggregation. Long prompts continue through
the full prefill→transfer→decode pipeline.

BC-P2-18: len(InputTokens) < threshold → Disaggregate=false
BC-P2-19: empty input → Disaggregate=false

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds "direct-to-decode" to validDisaggregationDeciders map and adds
factory panic case directing callers to use NewDirectToDecodeDecider.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds PDDirectDecodeThreshold field to DeploymentConfig, switch-based
decider construction in NewClusterSimulator, and updates interference
warning to only fire for fully-disaggregated deployments (decider=always).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ts (INV-P2-4a)

When pools are configured and a request is not disaggregated, route it
to the decode pool only via a new poolFilter field on RoutingDecisionEvent.
Decode instances handle both phases with interference cost from PR3.

BC-P2-14: non-disaggregated + pools → decode pool only
BC-P2-15: non-disaggregated → no ParentRequest records

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Registers the --pd-direct-decode-threshold flag (default 256) and wires
it to DeploymentConfig. Includes validation (>= 0) and stale-flag
warning when used with a different decider.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Verifies short prompts route directly to decode pool while long prompts
go through full PD pipeline (BC-P2-14, BC-P2-15, BC-P2-16, INV-1).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…decode

- INV-P2-4a: non-disaggregated + pools → decode pool (tested with never decider)
- INV-P2-4b/BC-P2-17: interference applied to mixed-phase decode batches
- INV-6: determinism for mixed short/long workloads
- BC-P2-13: always-disaggregate behavior unchanged by pool filter

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add --pd-direct-decode-threshold to CLI flags listing
- Add DirectToDecodeDecider to disaggregation.go file description
- Add PDDirectDecodeThreshold to DeploymentConfig description
- Add INV-P2-4 (decode-targeted routing) to invariants section
- Update disaggregated data flow with direct-to-decode decider option
- Update local path to show decode pool routing (INV-P2-4)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…delines update

- invariants.md: add INV-P2-3 (interference monotonicity) and INV-P2-4
  (decode-targeted routing) with verification strategies
- configuration.md: add direct-to-decode decider to table, semantics
  section, and --pd-direct-decode-threshold flag
- design-guidelines.md: add DisaggregationDecider to Section 4.2 module
  map with all four variants

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- TestDirectToDecodeDecider_ClusterConstruction: replace nil check with
  behavioral assertion (decode pool routing) per BDD/TDD principles
- TestDirectToDecodeDecider_MixedWorkload: tie INV-1 check to request
  slice lengths instead of hardcoded constant; clarify sub-request
  vs parent-request counting in comment
- buildPoolFilteredSnapshots: only construct snapshots for target pool
  members; eliminates O(N/2) wasted Snapshot() calls per routing event
- DisaggregationDecisionEvent.Execute: document no-pools fallback path
- DirectToDecodeDecider.Decide: add empty-input note to method docstring

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Addresses findings from comprehensive PR review:

Safety (C1, C3, C6, C8):
- C1: Add PDTransferBaseLatencyMs upper-bound guard (<=3.6e9 ms) in
  NewClusterSimulator to prevent int64 overflow in transfer duration
  calculation (silent 1 µs clamp bug)
- C3: Add PoolOverrides.Validate(name) method enforcing *TP>0,
  *MaxModelLen>0, *TotalKVBlocks>0; call from NewClusterSimulator
  before instance construction (R3)
- C6: Add BlockSizeTokens>0 check in NewClusterSimulator PD block so
  error appears at construction time, not as a panic inside Run() (R6)
- C8: Add clarifying comments above unreachable panic sites in
  newInstanceSimulatorCore explaining why they're safe and what to do
  if the function is ever called outside NewClusterSimulator

Documentation (C4, C5):
- C4: Update NewDisaggregationDecider docstring to include
  "direct-to-decode" in panic description
- C5: Update PDDecider field comment to include "direct-to-decode"
  in the valid-values enum

Test coverage (GAP-1, GAP-2, GAP-3):
- GAP-1: Add INV-P2-4 pool-membership assertion to
  TestPrefixThreshold_HighThresholdNoDisaggregation
- GAP-2: Add TestCollectPDMetrics_DroppedAtDecodeKV table-driven unit
  test for the TransferCompleteTime>0 && DecodeInstanceID=="" condition
- GAP-3: Add INV-P2-4 pool-membership assertion to
  TestDisaggregationDecisionEvent_SchedulesRouting

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…erage

Performance: detectPrefillCompletions/detectDecodeCompletions O(N×M) → O(K)
via pendingPrefillByInstance/pendingDecodeByInstance per-instance indexing.
Pending maps cleared at simulation end to release memory for dropped/horizon-
truncated requests.

Validation: warn when prefill+decode < total instances (unassigned idle).

Tests: TestDisaggregation_MaxModelLen_DropsOversizedRequests — exact drop count,
CompletedRequests verification, and INV-1 conservation check.

Docs: length_capped_requests anomaly counter; README "Four latency modes";
quickstart PD examples; tutorial Step 8 PD break-even walkthrough;
extension-recipes tier-composition pattern; pd-disaggregation-demo.yaml.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…pshot, contention accuracy, and docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…, and warning accuracy

- sim/disaggregation.go: add nil guard in DirectToDecodeDecider.Decide (panics with
  named message instead of NPE deep in event loop)
- sim/cluster/cluster_event.go: fix spurious "unhandled pool role" warning — known
  roles without custom policies fall through silently; warning reserved for
  unrecognized roles only
- sim/cluster/cluster_event.go: replace dead-code-path comment "When pools are NOT
  configured" with accurate "Defensive: always true here" note
- cmd/root.go: fix misleading zero-threshold warning — qualifies "for non-empty
  inputs" and notes the empty-input exception
- cmd/root.go: fix topology guard (&&→||) to catch the one-pool-zero case with the
  friendly "requires both" message

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant