Comparing changes

* feat(sim): add pool topology and disaggregation decision pipeline (#591) Establishes PD (Prefill-Decode) disaggregation foundation: pool topology as a first-class cluster concept and the disaggregation decision point in the event pipeline. When pools are not configured (--prefill-instances=0 --decode-instances=0), output is byte-identical to pre-PR1 behavior (BC-PD-1 verified at seeds 42, 123, 999). - Add DisaggregationDecider interface with NeverDisaggregate and AlwaysDisaggregate implementations (sim/disaggregation.go) - Add PoolRole type, ValidatePoolTopology, BuildPoolMembership (sim/cluster/pool.go) - Add DisaggregationDecisionEvent (priority 3) to cluster event pipeline - Conditional branch in AdmissionDecisionEvent.Execute: pools configured → DisaggregationDecisionEvent, otherwise → RoutingDecisionEvent - Add PrefillInstances, DecodeInstances, PDDecider fields to DeploymentConfig (zero-value = disabled, all existing construction sites backward-compatible per R4 audit) - Add --prefill-instances, --decode-instances, --pd-decider CLI flags - Update CLAUDE.md with new files and flags Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): convergence review fixes for pool topology PR Address 4 IMPORTANT findings from pr-code convergence review: 1. PoolMembership() now returns a defensive copy instead of the internal map, complying with R8 (no exported mutable maps) 2. CLI now calls ValidatePoolTopology at the boundary for clean error messages instead of relying on library panic 3. Factory test verifies concrete type via fmt.Sprintf("%T"), fixing unused wantType field 4. Integration tests verify INV-1 (request conservation) and INV-5 (causality) after simulation with disaggregation enabled Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* feat(cluster): add ParentRequest type and FilterSnapshotsByPool (BC-PD-5, BC-PD-7, BC-PD-9) - Add ParentRequest struct for tracking disaggregated request lifecycle - Add FilterSnapshotsByPool for pool-scoped routing - Foundation types for PR2 disaggregated request flow Co-Authored-By: Claude <[email protected]> * feat(cluster): add PD transfer config fields and CLI flags (EC-PD-2, BC-PD-14) - Add PDTransferBandwidthGBps, PDTransferBaseLatencyMs, PDKVBytesPerToken to DeploymentConfig - Add PrefillScorerConfigs, DecodeScorerConfigs for per-pool routing - Add CLI flags: --pd-transfer-bandwidth, --pd-transfer-base-latency, --pd-kv-bytes-per-token - Add CLI flags: --prefill-routing-scorers, --decode-routing-scorers - Validation: bandwidth finite positive, base latency finite non-negative, bytes-per-token > 0 (R3, R11) Co-Authored-By: Claude <[email protected]> * feat(cluster): implement end-to-end disaggregated request flow (BC-PD-5 through BC-PD-14) - Bifurcate DisaggregationDecisionEvent: disaggregate=true → PrefillRoutingEvent, false → RoutingDecisionEvent - Add 4 new event types: PrefillRoutingEvent (4), KVTransferStartedEvent (5), KVTransferCompletedEvent (6), DecodeRoutingEvent (7) - Add pool-filtered snapshot building with buildPoolFilteredSnapshots helper - Add prefill completion detection in cluster event loop via pendingPrefillCompletions map - Implement KV transfer pipeline: compute duration from block count and bandwidth, schedule events - Add AllocateTransferredKV and InjectDecodeOnline to InstanceSimulator - Add EnqueueDecodeSubRequest to Simulator (bypasses oversized guard and TotalInputTokens counting) - Add decode-only batch formation path in VLLMBatchFormation Phase 2 - Initialize per-pool routing policies with separate RNG partitions - Update PR1 test for disaggregated conservation counting Co-Authored-By: Claude <[email protected]> * test(cluster): comprehensive disaggregation invariant and integration tests (BC-PD-5 through BC-PD-15) - Pool exclusivity: prefill/decode sub-requests routed to correct pools (BC-PD-7) - Full path completion: requests complete through disaggregated pipeline (BC-PD-5) - Phase causality: timestamps form valid causal chain (INV-PD-4) - Transfer conservation: initiated == completed (INV-PD-3) - Pool stability: membership unchanged after simulation (INV-PD-5) - Determinism: same seed produces identical metrics (INV-6) - Backward compatibility: non-disaggregated path unchanged (BC-PD-13) - Per-pool scorer configs: separate routing policies work end-to-end (BC-PD-15) - AllocateTransferredKV: success and insufficient capacity cases Co-Authored-By: Claude <[email protected]> * docs: update CLAUDE.md and add PR2 implementation plan - Add disaggregated data flow diagram to Key Data Flow section - Add INV-PD-1 through INV-PD-5 to Key Invariants - Add new CLI flags to cmd/root.go description - Add pd_events.go and parent_request.go to file organization - Include PR2 micro plan in docs/plans/ Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]>

* feat(cluster): add PDMetrics struct and CollectPDMetrics (BC-1,BC-2,BC-4,BC-5,BC-6,BC-7,BC-10,BC-11) - PDMetrics: DisaggregatedCount, ParentTTFT, TransferDuration, PrefillThroughput, DecodeThroughput, LoadImbalanceRatio - CollectPDMetrics: pure function, R2-compliant sort, BC-11 TTFT filter - collectPoolThroughput: R11-guarded division, math.MaxFloat64 sentinel - 10 unit + integration tests covering all BC edge cases Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(cluster): add ParentRequests and PerInstanceMetricsByID accessors (BC-11,BC-12) - ParentRequests(): returns sorted-by-ID slice after Run() (R2, BC-11) - PerInstanceMetricsByID(): returns copy map after Run() (R8, BC-12) - Fix ParentTTFT to use PrefillSubReqID (decode sub-reqs don't record TTFT as ProgressIndex starts at inputLen; prefill sub-req TTFT = first token time) - Add 4 integration+invariant tests: accessor coverage, pool conservation (BC-3), BC-1 causality invariant Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(cluster): add PD *PDMetrics field to RawMetrics (BC-9) - RawMetrics.PD: nil when disaggregation inactive (Go zero value) - Named-field construction at all sites remains safe (R4) - Test BC-9: PD == nil in non-disaggregated CollectRawMetrics path Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(cmd): wire CollectPDMetrics and add printPDMetrics output (PR3) - Wire rawMetrics.PD = CollectPDMetrics(...) after CollectRawMetrics - Add printPDMetrics: prints === PD Metrics === section when pd != nil - Outputs DisaggregatedCount, throughput, LoadImbalanceRatio (inf sentinel), ParentTTFT percentiles, KV Transfer Duration percentiles - 2 output tests: nil no-op, section content verification Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs(claude): update file organization for PR3 pd_metrics.go - Add pd_metrics.go entry to sim/cluster/ tree - Update cluster.go entry with new accessor methods - Update metrics.go entry to mention PD *PDMetrics field Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review Round 1 fixes for PR3 pd-metrics Critical fixes: - pd_events.go: increment droppedAtDecodeKV on KV allocation failure (R1/INV-1) - cluster.go: account for droppedAtDecodeKV in aggregated DroppedUnservable after Run() - cluster.go: panic on nil parent in detectPrefillCompletions (R1, was silent continue) - docs/contributing/standards/invariants.md: add INV-PD-1 through INV-PD-5 (cross-doc fix) Important fixes: - pd_metrics.go: nil guard for aggregated param in CollectPDMetrics - pd_metrics.go: nil guard for *sim.Metrics entries in collectPoolThroughput - pd_metrics_test.go: fix BC-1 comment (was DecodeSubReqID, correct is PrefillSubReqID) - disaggregation_test.go: t.Skip → t.Fatal in BC-1 causality test (masks regressions) - disaggregation_test.go: remove internal field access from ParentRequests/PerInstanceMetrics tests - disaggregation_test.go: add TestClusterSimulator_DisaggregatedINV1_Conservation (R7) - docs/guide/results.md: document === PD Metrics === output section - docs/reference/configuration.md: document PD disaggregation flags Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review round 2 fixes for PR3 PD metrics - Fix R1: pd_metrics.go collectPoolThroughput panics on nil *sim.Metrics instead of silently skipping (was: silent continue, no diagnostic) - Fix R1: CollectPDMetrics panics when aggregated==nil with non-empty parents instead of silently returning nil (undocumented data loss path) - Fix INV-1 test: TestClusterSimulator_DisaggregatedINV1_Conservation now asserts the conservation identity (completed+queued+running+dropped == 2*N injected sub-requests) rather than a bare value check - Fix plan doc: update Section A "Key insight" and Section E "THE TRICKY PART" to describe PrefillSubReqID mechanism (not DecodeSubReqID) — the plan shipped with the original incorrect design that was corrected during implementation - Add PD disaggregation section to docs/guide/cluster.md with usage note warning that --pd-decider always is required to activate PD mode Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review round 3 fixes for PR3 PD metrics Critical bug fix — INV-PD-4 stale instance clock: - sim/simulator.go: add currentTime int64 parameter to EnqueueDecodeSubRequest; idle StepEvent now uses cluster clock (not stale instance sim.Clock), preventing decode completions from firing before decode enqueue time - sim/cluster/instance.go: propagate currentTime through InjectDecodeOnline - sim/cluster/pd_events.go: pass e.time to InjectDecodeOnline in DecodeRoutingEvent Decode completion tracking (required to set parent.CompletionTime): - sim/cluster/cluster.go: add pendingDecodeCompletions map; add detectDecodeCompletions; extend event loop to call detectDecodeCompletions on decode pool instances - sim/cluster/cluster.go: fix detectPrefillCompletions to collect+sort keys before processing (R2/INV-6 determinism: seqIDs must be assigned in stable order) Orphaned parent record fix (IMPORTANT from PC-8/PC-1): - sim/cluster/pd_events.go: set e.parentReq.CompletionTime = e.time when decode KV allocation fails, preventing ParentRequests() from returning records in limbo (TransferCompleteTime set but CompletionTime = 0) Test quality fixes (IMPORTANT from PC-3): - disaggregation_test.go: replace structural nil-check assertions in TestDisaggregation_PerPoolScorerConfigs with behavioral comment; rely on TotalOutputTokens > 0 as observable behavioral assertion (BDD/TDD refactor survival) - disaggregation_test.go: fix zero-timestamp check in TestDisaggregation_PhaseCausality to use CompletionTime > 0 (latency model always produces ≥1 step duration) - disaggregation_test.go: replace internal field access with public accessors (cs.parentRequests → cs.ParentRequests(), cs.poolMembership → cs.PoolMembership()) - disaggregation_test.go: add TestDisaggregation_NoCrossPoolRouting (NC-PD-1) Documentation (IMPORTANT from PC-4/PC-7): - docs/guide/cluster.md: add PD Troubleshooting section and Known Simplifications subsection documenting atomic transfer, no cross-instance preemption, no transfer retry, and fixed pool sizes as Phase 1 limitations - docs/guide/results.md: clarify PD metrics are stdout-only, not in JSON file - docs/contributing/extension-recipes.md: expand disaggregation decider recipe step 3 with full CLI parameter wiring code example Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim/cluster): convergence review fixes for PD metrics PR (Round 3) - docs/contributing/extension-recipes.md: fix DisaggregationDecider recipe - Wrong enum constants (DisaggregationDecisionDisaggregate/Local) replaced with correct DisaggregationDecision{Disaggregate: true/false} struct literals - Factory wiring now shows correct pattern: extend NewDisaggregationDecider signature to accept extra parameter (avoids import cycle) and update call site in cluster.go - cmd/root.go: use math.IsInf(pd.LoadImbalanceRatio, 1) instead of fragile >= math.MaxFloat64/2 comparison for infinity sentinel detection - sim/cluster/disaggregation_test.go: add TestDisaggregation_INV_PD_1_DecodeEnqueueAfterTransfer standalone R7 companion invariant test for INV-PD-1 (DecodeEnqueueTime >= TransferCompleteTime) - docs/guide/results.md: clarify DisaggregatedCount includes requests subsequently dropped at decode KV allocation; add aggregate-vs-realtime caveat to Load Imbalance Ratio note; add 2N JSON rows note explaining PD mode produces two sub-request rows per original request - sim/cluster/parent_request.go: document CompletionTime dual semantics (actual decode completion vs. decode routing event time for dropped parents) - docs/guide/cluster.md: add sizing guidance for high DroppedUnservable in PD mode Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim/cluster): convergence review fixes for PD metrics PR (Round 4) - docs/guide/cluster.md: correct misleading "fails silently" text; topology validation calls logrus.Fatalf and exits loudly, not silently - sim/cluster/disaggregation_test.go:190: use cs.ParentRequests() public accessor instead of cs.parentRequests unexported field (BDD consistency) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cmd): fix LoadImbalanceRatio sentinel check in printPDMetrics The sentinel for "one pool idle" in pd_metrics.go is math.MaxFloat64 (finite), not math.Inf(1). math.IsInf(math.MaxFloat64, 1) returns false, so the "inf (one pool idle)" display branch was unreachable after the Round 3 fix. Replace math.IsInf with pd.LoadImbalanceRatio == math.MaxFloat64 to precisely match the sentinel and restore correct output formatting. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * test(cmd): add TestPrintPDMetrics_LoadImbalanceRatio_OnePoolIdle for sentinel check Adds behavioral test verifying that printPDMetrics outputs "inf (one pool idle)" when LoadImbalanceRatio == math.MaxFloat64 (the BC-10 sentinel). Ensures the sentinel check introduced in aa6a7b9 (== math.MaxFloat64) is exercised and will catch regressions. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * test(sim/cluster): tighten LoadImbalanceRatio sentinel invariant test to use exact == math.MaxFloat64 The ZeroMinGuard test used `< math.MaxFloat64/2` as the failure condition, which accepts any large float >= MaxFloat64/2. Changed to `!= math.MaxFloat64` so the test precisely matches the sentinel check in printPDMetrics (BC-10). Co-Authored-By: Claude Sonnet 4.6 <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 <[email protected]>

* feat(trace): add PD disaggregation trace record types (BC-PD-17) - Add DisaggregationRecord, PrefillRoutingRecord, DecodeRoutingRecord, KVTransferRecord - Add Disaggregations, PrefillRoutings, DecodeRoutings, KVTransfers slices to SimulationTrace - Add RecordDisaggregation, RecordPrefillRouting, RecordDecodeRouting, RecordKVTransfer methods - Update NewSimulationTrace constructor to initialize all new slices Co-Authored-By: Claude <[email protected]> * feat(cluster): instrument DisaggregationDecisionEvent with trace recording (BC-PD-18) - Record DisaggregationRecord for every disaggregation decision when trace enabled - Non-disaggregated mode: no disaggregation records (BC-PD-18) - Update ClusterEvent.Priority() comment to include full PD priority range (4-7) - Add TestPDTrace_NonDisaggMode_NoDisaggRecords and TestPDTrace_DisaggMode_DisaggDecisionRecorded Co-Authored-By: Claude <[email protected]> * feat(cluster): instrument PD event handlers with trace recording (BC-PD-17, BC-PD-19) - Instrument PrefillRoutingEvent with PrefillRoutingRecord + counterfactual support - Instrument DecodeRoutingEvent with DecodeRoutingRecord + counterfactual + KVTransferRecord - KVTransferRecord recorded in DecodeRoutingEvent so DecodeInstanceID is fully populated - Add TestPDTrace_DisaggMode_AllRecordTypesPresent (BC-PD-17) - Add TestPDTrace_DisaggMode_Counterfactual (BC-PD-19) Co-Authored-By: Claude <[email protected]> * docs: update CLAUDE.md trace/ descriptions for PR4 PD trace records Co-Authored-By: Claude <[email protected]> * fix(cluster,trace): address convergence review findings (Round 1) R1: increment droppedKVAllocations counter when AllocateTransferredKV fails in DecodeRoutingEvent — silent drop violated INV-PD-3. R5: move DecodeInstanceID/DecodeEnqueueTime assignment to after successful KV allocation — eliminates stale state on failure path. trace: extend TraceSummary with PD fields (DisaggregationCount, DisaggregatedCount, KVTransferCount, MeanTransferDuration) so --summarize-trace output is meaningful for disaggregated runs. test: add TestPDTrace_DisaggMode_Cardinality invariant test (R7) asserting len(PrefillRoutings)==len(KVTransfers)==len(DecodeRoutings). test: remove fragile hardcoded count from TransferConservation test — assert conservation law only, not absolute count. docs: update extension-recipes.md to mention pd_events.go as second hook site for PD-specific trace records. docs: add PD trace record descriptions to docs/guide/cluster.md Decision Tracing section. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(cluster,trace): address convergence review findings (Round 2) - Print PD summary fields in --summarize-trace output (cmd/root.go) - Add DroppedKVAllocations() accessor to ClusterSimulator (R1 testability) - Add INV-PD-1..5 to docs/contributing/standards/invariants.md - Fix cardinality invariant test comment: general law uses DisaggregatedCount not len(Disaggregations) - Add R7 invariant test for DroppedKVAllocations counter (zero under ample capacity) - Add R7 invariant tests for new TraceSummary PD fields in summary_test.go Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: fix extension recipe for trace records (Round 3 fixes) - Correct KVTransferRecord hook location: recorded in DecodeRoutingEvent (not KVTransferStartedEvent) because DecodeInstanceID is only known at decode routing time - Add missing step 6: update --summarize-trace output block in cmd/root.go Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: add PR4 micro-plan for PD trace instrumentation Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(traces): address post-submission review findings - Fix INV-PD-1 verification citation: phase-causality tests live in disaggregation_test.go, not pd_traces_test.go - Surface DroppedKVAllocations in Anomaly Counters CLI output (R1: no silent data loss; visible even without --summarize-trace) - Add TransferStartTime > 0 assertion in TestPDTrace_DisaggMode_AllRecordTypesPresent (R7: invariant test for timestamp completeness) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(traces): address convergence review findings (Round 4) - Add DroppedKVAllocations field to RawMetrics and update CollectRawMetrics signature to propagate PD decode-OOM counter through the standard metrics API (was only accessible via cs.DroppedKVAllocations() accessor) - Add TestCollectRawMetrics_DroppedKVAllocations invariant test - Add defensive clamp for TransferDuration (INV-PD-4 guarantees non-negative, but clamp to 0 if ordering invariant is ever violated) - Document Scores map semantics in PrefillRoutingRecord/DecodeRoutingRecord (higher=more preferred, raw weighted-scorer output, not normalized) - Document PD trace activation conditions in extension-recipes.md (both --trace-level and pool flags required) - Document pool-filtered snapshot requirement for counterfactual computation Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(traces): address convergence review findings (Round 5) - Add logrus.Warnf when defensive TransferDuration clamp triggers, making INV-PD-4 ordering violations detectable in logs (R1: no silent data loss) - Add PD trace usage example with expected output to docs/guide/cluster.md, including guidance on interpreting Disaggregation Decisions, KV Transfers, and Mean Transfer Duration metrics Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(traces): address PR review findings (comment accuracy, doc invariants) - Fix INV-PD-4 comment in pd_events.go: transfer_start ≤ transfer_complete is guaranteed by timestamp sequencing (duration >= 1µs), not priority ordering - Fix R15 stale PR reference in cluster_event_test.go test docstring - Add per-request ordering note to DisaggregationDecisionEvent docblock - Add cross-record invariant doc to DisaggregationRecord (paired records guarantee) - Add Regret >= 0 invariant annotation to PrefillRoutingRecord and DecodeRoutingRecord - Add TransferDuration enforcement-location note to KVTransferRecord - Clarify TargetDistribution scope in TraceSummary (standard routing only, not PD pool routing) - Update INV-PD-4 verification in invariants.md to correctly describe timestamp sequencing Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(pd): address PR review findings — validation, diagnostics, tests, docs Critical fixes: - Fix stale DisaggregationDecisionLocal reference in extension-recipes.md - Fix incorrect pool topology troubleshooting (== → <=) in cluster.md - Fix misattributed population site comments in cluster.go Important improvements: - Add R3 validation for PD transfer parameters (PDKVBytesPerToken, PDTransferBandwidthGBps, PDTransferBaseLatencyMs) in NewClusterSimulator - Add pool-filtered snapshot guards with specific panic messages (I3) - Add post-simulation INV-PD-3 transfer conservation check - Add post-simulation diagnostics for orphaned pending completions - Add DroppedAtDecodeKV field to PDMetrics for mid-pipeline drop visibility - Surface DroppedAtDecodeKV in CLI output when > 0 - Add test for decode KV allocation failure path with INV-1 conservation - Add test for negative transfer duration clamp (INV-PD-4 defensive path) - Fix RawMetrics and bundle.go var block alignment inconsistencies Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude <[email protected]>

…ion (PR5) (#620) * feat(sim): add PrefixThresholdDecider and DisaggregationObserver Adds PrefixThresholdDecider: disaggregates when non-cached token count exceeds threshold. Maintains a router-side PrefixCacheIndex under a single globalVirtualInstance key to track cluster-wide prefix knowledge. Also adds DisaggregationObserver interface for stateful deciders that learn from routing decisions (ObserveRouting called synchronously by ClusterSimulator after each routing decision). Implements cachedHashes/cachedReqID pattern (mirrors routing_prefix_scorer.go) to avoid double-hashing between Decide() and ObserveRouting(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim): register prefix-threshold in validDisaggregationDeciders Adds "prefix-threshold" to the valid disaggregation decider names in bundle.go, enabling IsValidDisaggregationDecider() and ValidDisaggregationDeciderNames() to recognize the new decider. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim/cluster,cmd): add PDPrefixThreshold config field and CLI flag Adds PDPrefixThreshold int to DeploymentConfig for the prefix-threshold decider's non-cached token threshold. Adds --pd-prefix-threshold CLI flag (default 512) with >= 0 validation. Updates --pd-decider description to include "prefix-threshold". Wires PDPrefixThreshold into the config construction in cmd/root.go. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(sim/cluster): wire PrefixThresholdDecider and DisaggregationObserver - cluster.go: branch on PDDecider=="prefix-threshold" to construct PrefixThresholdDecider(PDPrefixThreshold, BlockSizeTokens) directly; add notifyDisaggregationObserver helper - cluster_event.go: call notifyDisaggregationObserver after RoutingDecisionEvent injection (standard routing path, BC-PD-28) - pd_events.go: call notifyDisaggregationObserver after PrefillRoutingEvent injection (disaggregated path, BC-PD-28) - disaggregation_test.go: add integration tests verifying wiring, high/low threshold behavior, observer call, and transfer conservation Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * chore: update CLAUDE.md for PrefixThresholdDecider and --pd-prefix-threshold Documents DisaggregationObserver interface, PrefixThresholdDecider, and PDPrefixThreshold in the file organization table. Adds --pd-prefix-threshold to the CLI flags list and the disaggregated data flow CLI flags summary. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review fixes for prefix-threshold decider PR Round 1 fixes: - Update DisaggregationObserver docstring with R13 note and R17 signal freshness guarantee - Add noopDisaggregationObserver to satisfy R13 (>=2 implementations) - Add explicit comment in DecodeRoutingEvent explaining why observer is intentionally not called on the decode sub-path - Add 'Adding New Disaggregation Deciders' section to extension-recipes.md covering both stateless and stateful patterns - Fix Decide() docstring: clarify hash-reuse fires only on non-disaggregated path (disaggregated path receives prefill sub-request with different ID) Round 2 fixes: - Remove structural TestPrefixThreshold_DeciderWiredCorrectly; replace with compile-time interface assertion (behavioral coverage exists in ZeroThresholdAlwaysDisaggregates and HighThresholdNoDisaggregation) - Add PDPrefixThreshold field to newTestDisaggDeploymentConfig (R4 construction site audit) - Add PD disaggregation row to CLI Flag Summary table in configuration.md including --pd-prefix-threshold and all PD flags introduced in PR1-PR4 Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): convergence review fixes for prefix-threshold decider PR (Round 3) - Add logrus.Warn when --pd-prefix-threshold is explicitly set but --pd-decider is not "prefix-threshold" (silent flag ignored, R3 spirit) - Add PD Disaggregation section to docs/reference/configuration.md explaining decider options, prefix-threshold semantics (non-cached tokens vs total tokens), default value meaning, and all PD flags Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim/cluster): strengthen BC-PD-28 test to verify observer cache-warming effect TestPrefixThreshold_ObserverWarmsCache replaces the previous TestPrefixThreshold_ObserverCalledAfterRouting which only checked causality invariants (TransferCompleteTime != 0) and did not verify that the DisaggregationObserver actually warmed the prefix cache. The new test uses two requests with a shared 192-token prefix (threshold=150): req1 disaggregates (192 > 150 with empty cache), observer records 12 blocks, req2 arrives later with the same prefix (58 non-cached tokens ≤ 150) and is routed locally — proving the observer was called and the cache was warmed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * docs: fix extension-recipes step count and remove broken configuration.md cross-link - extension-recipes.md: add missing step 4 (update configuration.md table) and fix touch-point count from 3 to 4 for stateless disaggregation deciders - configuration.md: replace broken cross-link to non-existent architecture.md#pd-disaggregation with descriptive text documenting the pool topology constraint (prefill + decode <= num-instances) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix(sim): restore R1 compliance and state mutation order in DecodeRoutingEvent - Restore droppedKVAllocations counter and DroppedKVAllocations() accessor (R1: count dropped requests — never silent) - Move AssignedInstance/DecodeInstanceID/DecodeEnqueueTime assignment to after successful AllocateTransferredKV check to prevent inconsistent state on failure - Restore DroppedKVAllocations() in cmd/root.go anomaly counter output - Add INV-PD-1 through INV-PD-5 back to invariants.md for DRY compliance (these were in CLAUDE.md but not the canonical standards doc) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: address review findings — silent config failure, stale docs, structural tests Critical: Add CLI validation that --pd-decider other than "never" requires --prefill-instances and --decode-instances to be set (R1: no silent config failure). Important: - Remove superseded disaggregation decider recipe (incorrect field names, wrong flag names, stale purity contract) from extension-recipes.md - Add prefix-threshold to first PD flags table in configuration.md - Replace internal field accesses (cs.parentRequests, cs.transfersInitiated, cs.trace, cs.droppedAtDecodeKV) with public accessors (cs.ParentRequests(), cs.Trace(), cs.DroppedKVAllocations()) in disaggregation_test.go for refactor survival (BDD/TDD principle #5) Also includes minor comment condensations from code-simplifier agent. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): document Dropped KV Allocations anomaly counter Add the new PD-mode anomaly counter to the results guide so users understand what it means when decode KV allocation fails. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Sonnet 4.6 <[email protected]>

* feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579) Add vLLM-equivalent max_model_len enforcement at three layers: 1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks 2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542); input + MaxOutputLen > MaxModelLen rejected when client declares budget 3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth) Key design decisions: - Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens. Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles output growth. Verified by behavioral + structural grep tests. - Auto-derive from HF max_position_embeddings for roofline/crossmodel backends, with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn special-case using original_max_position_embeddings, and KV-feasible capping. - Overflow-safe ceiling division in startup validation (R11). - R3 validation at CLI (logrus.Fatalf) and constructor (panic). New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests (input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement. Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel. Blackbox gap tracked in #578. Closes: #567 Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): MoE-aware roofline latency model (#559) (#561) * feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig Two new fields for MoE-aware roofline: per-routed-expert FFN dimension and total shared-expert FFN dimension. Both default to 0 (dense model). Zero-value safe for all existing construction sites (R4 audit: all dense model configs use zero-valued MoE fields). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): parse MoE per-expert and shared-expert dims from HF config Extends GetModelConfigFromHF to parse moe_intermediate_size, shared_expert_intermediate_size, and n_shared_experts. Expert count resolution chain extended to include num_routed_experts (DeepSeek-V3). Implements BC-15 through BC-18 from the MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): add MoE consistency validation to ValidateRooflineConfig Validates: experts>0 requires active>0, active<=total, non-negative MoE dimensions. Catches inconsistent MoE configs at construction time. Implements BC-12, BC-13, BC-14 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(sim): address convergence review findings (I-1, I-2) I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size (matches HF config field name convention, consistent with other tags). I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig (R3 compliance — all numeric parameters validated). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests MoE FLOPs (Task 4): calculateTransformerFlops now computes routed (top_k), shared, and gate MLP FLOPs for MoE models. Dense models use unchanged code path (NumLocalExperts=0 guard). Active weights (Task 5): calculateMemoryAccessBytes uses top_k (active experts) for per-step weight bandwidth, matching vLLM's fused_moe kernel behavior. Includes shared expert and gate weights. Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs). Implements BC-1 through BC-6, BC-10 from MoE roofline design. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation Fixes the critical bug where DeepSeek-V3's general intermediate_size (18432) was used as per-expert dim (should be 2048), overestimating MLP weights by ~9× and returning zero usable KV blocks. Changes: - KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields - NewKVCapacityParams gains 2 new positional args (R4 enforced) - computeModelWeightBytes uses per-expert dim when nonzero, falls back to IntermediateDim (Mixtral convention) - ExtractKVCapacityParams propagates new fields, extends expert count chain to include num_routed_experts (parity with GetModelConfigFromHF) Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation), BC-11 (dense unchanged). Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): convergence review round 2 — R23 parity, documentation, R15 I-1: Align expert count resolution threshold between GetModelConfigFromHF and ExtractKVCapacityParams. Both now use >1 threshold (single-expert models are dense-equivalent). Fixes R23 code path parity violation. I-2: Add precondition comments to calculateTransformerFlops and calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement. I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)). I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg). I-5: Document active vs total weight distinction in calculateMemoryAccessBytes to prevent future R23 regression. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): align MoE threshold to > 1 across all consumption paths (R23) Parsing layer already used > 1 (single-expert models are dense-equivalent). Consumption paths (calculateTransformerFlops, calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig, computeModelWeightBytes) now use > 1 as well, matching the documented design intent and resolving the R23 code path parity violation. Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1"). Round 3 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan - cmd/root.go: Replace misleading "assumes dense transformers" warning with accurate MoE info message (roofline now models per-expert FLOPs) - sim/model_hardware_config.go: Run gofmt to fix struct field alignment - docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note for threshold change (> 0 → > 1) Round 4 convergence review fixes. Part of #559 Co-Authored-By: Claude Opus 4.6 <[email protected]> * refactor(latency): port llm-optimizer single-crossover roofline physics Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover: step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth) Remove bandwidth haircut (BwEffConstant no longer used in step time). Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency). Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU 3-matrix MLP, MoE support, FlashAttention-aware memory model. Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%. The dual ceiling + bandwidth haircut + overhead stacking caused ~3x systematic over-prediction for memory-bound decode steps. Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: update MFU values to llm-optimizer defaults (0.45/0.30) MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries. These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE on the sim-to-real evaluation (discussion #522). Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged for backward compatibility — they are no longer used by rooflineStepTime() but may be consumed by other callers. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add roofline llm-optimizer port design and implementation plan Design doc: decision record for porting llm-optimizer's single-crossover roofline physics into BLIS. Implementation plan: 3 tasks (physics rewrite, MFU update, verification). Motivation: discussion #522 sim-to-real accuracy validation. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): load weights once per step in roofline (unified forward pass) vLLM chunked prefill processes all tokens (prefill + decode) in a single forward pass — weights are loaded from HBM once per step, not once per phase. The previous implementation loaded weights independently for prefill and decode phases, doubling the memory-bound term for mixed batches (~2x over-prediction). Sources: vLLM V1 blog ("all selected requests are flattened and concatenated into one long super-sequence for that single forward pass"), Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is amortized across all prompts in a batch"). Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies the overhead of adding prefill to a decode step is much less than a full weight load (7µs vs 4166µs). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both calculateTransformerFlops and calculateMemoryAccessBytes, matching llm-optimizer's formulation. For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix formula produced 31% more MLP weight bytes than llm-optimizer's 2-matrix formula, directly inflating memory-bound decode predictions. Applies to both dense and MoE paths (routed + shared expert FLOPs/weights). Co-Authored-By: Claude Opus 4.6 <[email protected]> * config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%) MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%) Motivation: after porting llm-optimizer single-crossover physics, BLIS roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed H100 tensor core utilization for large prefill GEMMs and batched decode. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump to 0.55/0.35 went the wrong direction (both models under-predict). Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the 2-matrix formula copied from llm-optimizer. SwiGLU actually has 3 weight matrices that all need HBM loading: this is the physically correct formula and increases weight bytes by ~37%, which reduces the under-prediction from ~50% toward the target. Dense and MoE paths both updated consistently (R23). Co-Authored-By: Claude Opus 4.6 <[email protected]> * feat(latency): conditional SwiGLU detection via HiddenAct field Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu) or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act field. Empty defaults to SwiGLU since most modern LLMs use it. Both calculateTransformerFlops and calculateMemoryAccessBytes now use nMat instead of hardcoded 3, correctly handling non-SwiGLU models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): revert to 2-matrix MLP convention matching llm-optimizer 3-matrix with raw intermediate_size over-predicts for models like Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU (2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach where 2 × d × intermediate ≈ physical weight count for most models. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): remove MoE-specific branches from roofline step time Roofline now treats MoE models identically to dense (matching llm-optimizer which has no MoE-specific handling). MoE fields (NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still used by KV capacity (kv_capacity.go) for GPU memory budgeting. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): MoE roofline scales weights by E, FLOPs by top_k Mixtral was under-predicted by ~10x because the dense treatment loaded 1 expert's MLP weights instead of all 8. Fix: - Weight bandwidth: E × MLP weights (all experts loaded from HBM per step) - FLOPs: top_k × MLP FLOPs (only active experts compute per token) Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): use MoEExpertFFNDim in roofline when set For DeepSeek-V3 style models where intermediate_size (18432) differs from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and FLOP calculations. Falls back to IntermediateDim when unset (Mixtral). Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix(latency): address PR #561 review — revert crossmodel scope, fix docs - Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation: crossmodel behavioral change doesn't belong in a roofline PR) - Fix design doc table and CLI comment claiming roofline models shared experts and gate FLOPs (it doesn't — only KV capacity does) - Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs 2-matrix MLP (mlpMatrixCount always returns 2) - Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity) design choice with cross-references in both files Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587) - Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass) - Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4) - Add end-to-end sim.Run() test for BC-5 runtime length cap path - Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6) - Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap) - Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM) - Add rope_scaling parse-failure warnings for malformed HF configs (BC-8) - Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16) Fixes #580 Co-authored-by: Claude <[email protected]> * refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596) These fields became dead code after the roofline physics port (llm-optimizer single-crossover model). No runtime code path reads them; ValidateRooflineConfig enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates config-file clutter and prevents future contributors from assuming they're active. Fixes #590 Co-authored-by: Claude Opus 4.6 <[email protected]> * Configure claude on GH Actions (#600) Signed-off-by: Jing Chen <[email protected]> * Enable claude on PRs (#601) Signed-off-by: Jing Chen <[email protected]> * ignore training and actions runner (#607) Signed-off-by: Srinivasan Parthasarathy <[email protected]> * fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606) Complete 7 deferred hardening items from issue #580 (PR #587 handoff): 1. Extract applyRopeScaling as a pure function with 26 table-driven test cases covering blacklist (su/longrope/llama3), mrope fall-through, gemma3 substring match (handles text_config pivot), yarn original base, overflow guards, NaN/Inf defense, degenerate inputs. 2. Change MaxModelLen from int to int64 for consistency with ProgressIndex, TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant int64() casts, adds int64() widening at EnqueueRequest comparison sites. 3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit) and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests drain, Metrics.Requests map cleanup. 4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no spurious force-completion during multi-chunk prefill (TotalOutputTokens=49, LengthCappedRequests=0, TTFT recorded). 5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9). 6. Refine rope_scaling documentation with explicit blacklist details. 7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites model_type from "gemma3" to "gemma3_text", making the exact-match check dead code. Changed to strings.Contains to match vLLM's substring semantics. Related to #580. Discovered issues: #602, #603, #604, #605. Co-authored-by: Claude <[email protected]> * feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616) * feat(latency): register trained-roofline backend name (BC-1) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15) - Add PostDecodeFixedOverhead() int64 to LatencyModel interface - Existing backends (blackbox, roofline, crossmodel) return 0 - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections - Zero heap allocations in StepTime (19ns/op, 0 allocs/op) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14) - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim, TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients - Derives architecture features at construction: headDim, dKV, dFF, kEff - Table-driven error tests for all validation paths Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12) - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas - Add TrainedRooflineDefaults struct to cmd/default_config.go - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard, HFConfig parsing, help text) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to latency model documentation - CLAUDE.md: "Four modes", file tree, Key Data Flow - sim/config.go: Backend field comment - sim/latency/latency.go: package doc - docs/concepts/core-engine.md: "four latency model backends" - docs/concepts/glossary.md: "Four modes" + trained-roofline description - Plan committed alongside implementation Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0 - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to avoid contaminating per-token average ITL with fixed overhead - Add zero-alpha warning for trained-roofline CLI path Caught by code review Step 4.5. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(guide): comprehensive trained-roofline section in latency models guide - Add trained-roofline section with formula, alpha model, accuracy caveats - Update comparison table to 4 backends - Update recommendation: trained-roofline is now the default for new models - Update pluggable architecture to show 4 interface methods - Fix cross-model description accuracy Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points, added trained-roofline as 4th example - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun - Configuration reference: add trained-roofline to --latency-model flag description Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add trained-roofline to quickstart + document non-blocking overhead pattern - Quickstart: add trained-roofline example (recommended for new models) - recordRequestCompletion: document that E2E includes non-blocking PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why RequestCompletionTimes exceeds RequestLeftEvent timestamp Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline All documentation working copies now mention trained-roofline consistently. Source-of-truth map: 12/12 working copies updated. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621) * feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4) - Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client omits budget (MaxOutputLen==0) and maxModelLen > 0 - Mirrors vLLM input_processor.py:554 safety cap - No auto-fill when client sets budget (BC-2), unlimited mode (BC-3), or input exceeds context (BC-4) Refs: #572 Co-Authored-By: Claude <[email protected]> * feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7) - generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal) - replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay) - reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning) - Matches inference-perf pattern: max_tokens = sampled output length Fixes #572 Co-Authored-By: Claude <[email protected]> * docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing Co-Authored-By: Claude <[email protected]> * docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics - Three tests referenced 'input-only check' for MaxOutputLen=0 - After auto-fill, MaxOutputLen is set to maxModelLen - input - Tests still pass numerically; comments now reflect actual behavior Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * docs: switch default example model to public Qwen/Qwen3-14B (#608) * docs: switch default example model to public qwen/qwen2.5-7b-instruct Replace gated meta-llama/llama-3.1-8b-instruct with publicly available qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart, tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel examples now work without HF authentication. Set qwen default TP=1 in defaults.yaml so examples use the default without explicit --tp flags. Update KV block count, coefficient examples, and prose references to match TP=1 values. Fixes #545 Co-Authored-By: Claude Opus 4.6 <[email protected]> * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1) Update default and trained-coefficient vllm_version for qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0. Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: switch default example model from qwen2.5-7b to qwen3-14b Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with pre-trained coefficients already in defaults.yaml. Update all documentation examples and references accordingly. Co-Authored-By: Claude Opus 4.6 <[email protected]> * fix: address review comments — stale refs, tutorial throughput - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3) - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4) - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per instance, target 20 req/s (was 57 req/s / 500 req/s for llama) - Scale experimentation.md example to match (20 req/s, not 400) Co-Authored-By: Claude Opus 4.6 <[email protected]> * docs: add HF_TOKEN tip to quickstart and README for gated models Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace, which fails for gated models without authentication. Add a lightweight tip after the first roofline example in both files recommending HF_TOKEN for gated model access and rate limit avoidance. Co-Authored-By: Claude Opus 4.6 <[email protected]> --------- Co-authored-by: Claude Opus 4.6 <[email protected]> --------- Signed-off-by: Jing Chen <[email protected]> Signed-off-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Srinivasan Parthasarathy <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Dipanwita Guhathakurta <[email protected]> Co-authored-by: Jing Chen <[email protected]>

) The PD decode-only path in VLLMBatchFormation.FormBatch() used the condition `ProgressIndex >= inputLen` to detect decode sub-requests with pre-allocated KV. For zero-input requests (len(InputTokens)==0), this condition is satisfied since ProgressIndex(0) >= inputLen(0), causing non-PD requests to incorrectly take the decode-only path. Add `ProgressIndex > 0` guard so the path only fires for PD decode sub-requests where AllocateTransferredKV explicitly set ProgressIndex to inputLen (which is > 0 for real requests). This ensures the pd branch is a fully transparent drop-in replacement for non-PD users. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

…637) * feat(sim/cluster): add PoolOverrides type and ResolvePoolConfig function Pure config resolver for per-pool hardware overrides (BC-P2-1, BC-P2-2). Pointer types for TP/MaxModelLen/TotalKVBlocks (R9). Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): add per-pool override fields and resolveConfigForRole DeploymentConfig gains PrefillOverrides/DecodeOverrides (PoolOverrides). resolveConfigForRole dispatches to ResolvePoolConfig per role. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): add BuildPoolMembershipFromIndices for pre-construction use Index-based variant of BuildPoolMembership that generates instance IDs using the same naming convention without requiring constructed instances. Existing function retained for backward compatibility (BC-P2-5). Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): per-pool config in NewClusterSimulator instance construction Refactor instance construction to resolve per-pool config using resolveConfigForRole. Pool membership computed from indices before instances (INV-P2-1). Reuses prePoolMembership to avoid redundant BuildPoolMembership call. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(cmd): add per-pool CLI flags for hardware overrides 8 new flags: --prefill-tp, --decode-tp, --prefill-hardware, --decode-hardware, --prefill-latency-model, --decode-latency-model, --prefill-max-model-len, --decode-max-model-len. Uses cmd.Flags().Changed() for R18 flag precedence. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(cmd): per-pool KV auto-calculation for analytical backends When per-pool TP/GPU differs from global and analytical backend is active, CalculateKVBlocks is called per-pool with pool-specific parameters (BC-P2-4). Override construction moved before analytical backend block for scope access. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(sim/cluster): INV-P2-1 invariant test and heterogeneous cluster helper Verifies pool-config consistency with heterogeneous KV capacity. Adds newHeterogeneousDeploymentConfig helper for future PR consumption. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: update CLAUDE.md with per-pool config files and CLI flags Add resolve.go to file organization, per-pool CLI flags to root.go description, INV-P2-1 to invariants section, per-pool hardware flags to disaggregated data flow section. Part of #633 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address convergence review findings for per-pool hardware config - R3: validate per-pool TP (>0), MaxModelLen (>0), and latency backend names at CLI boundary before passing to cluster construction - Fix comment typo: decodeLatencyModel was labeled "prefill pool" - Move ValidatePoolTopology before BuildPoolMembershipFromIndices to fail fast before instance allocation - Warn when per-pool flags are set but PD mode is not active - Enhance TestINV_P2_1 to verify per-instance KV capacity via observable FreeKVBlocks() before simulation (not just post-sim completion) - Add TestResolvePoolConfig_Idempotent as R7 companion invariant test - Document struct-copy safety and latency backend constraint in resolve.go Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: address convergence review findings for per-pool hardware config - Add INV-P2-1 (Pool-Config Consistency) to canonical invariants.md - Add Per-Pool Hardware Overrides section to configuration.md with flag table, KV auto-calc explanation, known limitation, and CLI example - Update CLI flag summary table to include all 8 per-pool flags - Fix stale "homogeneous instances" note in cluster.md guide Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: address PR review findings — error returns, stale comments, R23/R11 guards - C1/C2: Run() now returns error on INV-PD-3 violation instead of warn-and-continue; negative inFlightRequests logged at Error level - C3: DeploymentConfig docstring updated to reflect per-pool overrides - C4: NewLatencyModel docstring includes trained-roofline; stale backend enumerations and priority comments fixed - I1: MoE threshold in trained-roofline factory changed from > 0 to > 1 (R23 parity) - I5: KVUtilization() division guard added (R11) - I12-I14: Fixed misleading DisaggregationDecisionEvent ordering comment, incorrect R9 rationale, and wrong ParentRequest distinction method - S7: Compile-time DisaggregationObserver interface check for PrefixThresholdDecider Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: R23 crossmodel MoE threshold parity (> 0 → > 1) Crossmodel backend used NumLocalExperts > 0 for MoE detection while roofline, trained-roofline, and kv_capacity all used > 1. Single-expert models (NumLocalExperts=1) are dense-equivalent and should not trigger MoE dispatch overhead in step time estimation. Align crossmodel with all other backends and update stale docs reference in docs/guide/latency-models.md. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

* feat(sim/cluster): KV transfer contention model (PD Phase 2, PR2) Model shared-bandwidth effects when multiple KV transfers overlap. When --pd-transfer-contention is enabled, each concurrent transfer receives fair-share bandwidth: effective_bw = total_bw / max(1, N). - Active transfer counter (increment on start, decrement on completion) - Fair-share bandwidth division in KVTransferStartedEvent.Execute() - CLI flag --pd-transfer-contention (bool, default false) - INV-P2-2 invariant: effective_bandwidth = total_bandwidth / max(1, active) - Contention metrics: PeakConcurrentTransfers, MeanTransferQueueDepth - BC-P2-5: single transfer identical to Phase 1 - BC-P2-7: INV-PD-3 (transfer conservation) still holds - Backward compatible: feature off by default Closes #634 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs(standards): add INV-P2-2 to canonical invariants.md Convergence review (Round 1) identified that INV-P2-2 (transfer fair-share) was added to CLAUDE.md but not to the canonical source docs/contributing/standards/invariants.md. This fixes the DRY violation by adding the full invariant definition with statement, verification references, mechanism, and hypothesis family. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review findings for KV transfer contention model C1: add activeTransfers underflow guard in KVTransferCompletedEvent.Execute() matching the inFlightRequests pattern (logrus.Errorf + clamp-to-zero, R1) C2: add post-simulation diagnostic when activeTransfers != 0 at Run() exit, making horizon-truncated contention state visible to operators I1: fix INV-P2-2 formula description — replace max(1, active_transfers) with accurate two-case description in pd_events.go, invariants.md, and CLAUDE.md I2: document split ownership in CollectPDMetrics — PeakConcurrentTransfers and MeanTransferQueueDepth must be attached by callers after CollectPDMetrics I3: change transferDepthSum/transferStartCount to int64 to prevent silent arithmetic overflow on 32-bit platforms I4: remove stale "Phase 2, PR2" locator comments (R15) — replace with stable functional descriptions referencing INV-P2-2 and the flag name S1: replace formula unit test with behavioral test that drives actual ClusterSimulator and measures observed TransferCompleteTime - TransferStartTime S2: change t.Skipf to t.Fatalf in BCP26_FairShareDivision — concurrent transfers must occur; skip silently hid coverage gaps Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): address convergence review findings for transfer contention model - invariants.md: fix stale Verification line for INV-P2-2 (claimed table-driven N=1,2,4 but test only verifies N=1; N>1 behavioral coverage is via BCP26) - deployment.go: remove stale PR1/PR2/Phase2 references from field group comments - cluster.go: fix activeTransfers warning comment to accurately describe when the condition is reachable (only via prior negative-guard correction, not horizon truncation as the old comment incorrectly stated) - transfer_contention_test.go: refactor MeanQueueDepthZeroTransfers from ClusterSimulator struct literal to production integration path (NewClusterSimulator + mustRun with 0 requests); add explanatory comment to MeanQueueDepthCalculation - docs/reference/configuration.md: add --pd-transfer-contention to PD flags table; clarify --pd-transfer-bandwidth "shared global fabric" semantics - docs/guide/results.md: document PeakConcurrentTransfers and MeanTransferQueueDepth metrics with accurate descriptions (including NOT-a-queue-depth caveat) - docs/guide/cluster.md: add "when to enable --pd-transfer-contention" guidance Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review findings for transfer contention model Critical fixes: - R3: add int64 overflow guard for BlockSizeTokens * PDKVBytesPerToken in NewClusterSimulator; overflow previously silently clamped transfer duration to 1 µs for pathologically large parameter combinations - Clarify activeTransfers warning comment: INV-PD-3 catches horizon truncation first, so the post-run warning fires only on undetected bookkeeping imbalance Important fixes: - Add contentionBookkeepingCorrupted bool field; set by negative guard in KVTransferCompletedEvent; Run() returns error instead of delivering silently invalid contention metrics to callers - Add TestTransferContention_INVP22_N2FormulaExact: direct formula verification for N=2 (17 µs), complementing the existing N=1 test (9 µs), covering the increment-before-calculate ordering invariant - Add TestTransferContention_NegativeGuard_SetsCorruptionFlag: exercises the negative-guard path via in-package state manipulation, verifying the flag is set and activeTransfers is reset to 0 - Strengthen TestTransferContention_BCP26_FairShareDivision assertion from >= mean-1 (lower-bound only) to > mean (strict), so the test fails if the contention branch is dead code Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): address convergence review findings for PD transfer model - fix(pd_events): use float64 arithmetic for transfer bytes to eliminate int64 silent overflow (R11 — numBlocks * blockSizeBytes could wrap for extreme configs; float64 handles any realistic block count safely) - fix(docs): replace remaining meta-llama examples with qwen/qwen3-14b in latency-models.md, cluster.md, configuration.md (4 instances) - feat(docs): add Context Window Enforcement section to cluster.md documenting --max-model-len, auto-derivation, and KV-feasible capping - fix(docs): add DroppedAtDecodeKV row to results.md PD metrics table, cross-referencing stdout label vs struct field name - fix(test): add R7 companion invariant test TestPrintPDMetrics_Invariant (nil/non-nil duality law) in cmd/kv_metrics_output_test.go - fix(test): add R7 companion invariant test TestCollectPDMetrics_Invariant (LoadImbalanceRatio ≥ 1.0 law) in sim/cluster/pd_metrics_test.go - docs(resolve): document PoolOverrides pointer field contract for library callers constructing DeploymentConfig directly Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review findings for PD transfer contention model - R3: add NaN/Inf guards to NewClusterSimulator for PDTransferBandwidthGBps and PDTransferBaseLatencyMs (library-level dual validation) - Compound contentionBookkeepingCorrupted into INV-PD-3 error when both conditions are true (prevents swallowing corruption signal) - Add logrus.Errorf in zero-bandwidth fallback branch (R1: no silent drops) - Add comment on contention/non-contention activeTransfers asymmetry - Test: end-to-end corruption flag → Run() error path - Test: duration floor for 0-block transfers (1 µs minimum) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): address convergence review findings for PD transfer contention F1: Reject --pd-transfer-contention when PD disaggregation is not active (R3). F2: Always print contention metrics when feature is enabled, even if zero. F3: Add R7 companion invariant tests for formula golden tests (divisor law, duration floor property, monotonicity). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(cmd): address convergence review findings for PD transfer contention - Add logrus.Warnf when max_position_embeddings is absent/zero in HF config during analytical backend auto-derivation; previously silent - Add NaN/Inf validation for trained-roofline beta/alpha coefficient arrays loaded from defaults.yaml (R20: allZeros passes NaN) - Add TestRunCmd_MaxModelLen_FlagRegistered flag registration test Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * docs(guide): document cluster-wide bandwidth pool assumption for PD transfer contention Add a "Single shared bandwidth pool" admonition to the --pd-transfer-contention guidance in docs/guide/cluster.md. Clarifies that all concurrent KV transfers share one cluster-wide bandwidth budget regardless of which prefill/decode instance pair is involved, and advises users to set --pd-transfer-bandwidth to the aggregate shared capacity (not per-NIC bandwidth) for accurate contention modeling. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

#647) * docs: add design spec for PD interference model (#635) Specifies the InterferenceLatencyModel wrapper that applies multiplicative slowdown to StepTime() based on batch phase composition for break-even analysis between disaggregation transfer cost and co-location interference. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add implementation plan for PD interference model (#635) 5-task plan: wrapper + tests, injection plumbing, CLI flags, integration test, documentation updates. TDD throughout. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): add InterferenceLatencyModel wrapper (#635) Tier composition wrapper that applies multiplicative slowdown to StepTime() based on batch phase composition. Satisfies BC-P2-9 through BC-P2-12 and INV-P2-3 (multiplier >= 1.0). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): wire interference model into instance construction (#635) Add PDInterferencePrefill/Decode to DeploymentConfig. Extract newInstanceSimulatorCore to wrap latency model when factors are non-zero. Public NewInstanceSimulator API unchanged (R4: 23 test call sites unaffected). Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat(cmd): add --pd-interference-prefill/decode CLI flags (#635) Wire interference factors through CLI → DeploymentConfig → instance construction. Validated as finite non-negative (R3). Default 0 = no interference (BC-P2-9). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(sim/cluster): add cluster integration test for interference model (#635) Verifies that non-zero interference factors produce longer simulation times and per-request E2E latencies compared to the zero-interference baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: update CLAUDE.md and design guidelines for interference model (#635) Add INV-P2-3 (interference monotonicity), CLI flags, file organization entry, and module map entry for InterferenceLatencyModel. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix(sim/cluster): convergence review fixes for interference model (#635) - Strengthen TestNewInstanceSimulatorCore_WrapsLatencyModel with behavioral assertions (mixed batch slows down, phase-pure unchanged, INV-P2-3 holds) - Add R3 validation for PDInterferencePrefill/Decode in NewClusterSimulator - Add MaxInterferenceFactor=100.0 exported constant; add upper-bound validation at all three layers (CLI, cluster constructor, factory) to prevent silent int64 overflow on degenerate inputs (R20) - Use single exported constant from all validation sites (DRY — no drift risk) - Add tie-break comment explaining conservative max-factor choice at equal split - Expand interference.go doc to document no-op in PD disaggregated mode - Update CLI help text with formula example (factor=0.5 → 1.25x at even split) - Add "Co-Location Interference Model" section to docs/guide/cluster.md - Add --pd-interference-prefill/decode rows to docs/reference/configuration.md - Add PD disaggregation mention to README features list Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): address PR review findings for interference model - Move interference factor validation before instance construction in NewClusterSimulator so the authored error messages are reachable and no partial allocation occurs on invalid configs (Critical) - Add TestNewClusterSimulator_InvalidInterferenceFactors_Panics with 8 table cases (negative, NaN, ±Inf, above-max for each field) to verify ClusterSimulator panic messages fire before any allocation - Fix "no-op at zero" comment in newInstanceSimulatorCore to accurately describe || semantics: "no-op only when both are zero" - Add 4 asymmetric-factor cases to TestInterferenceLatencyModel_StepTime verifying one-factor-zero behavior for each dominant phase - Add TestInterferenceModel_ClusterIntegration_INV_P2_3 as R7 invariant companion: verifies SimEndedTime and per-request E2E are non-decreasing under interference, with non-vacuity assertion - Fix R20 → R3 misattribution on MaxInterferenceFactor and NewInterferenceLatencyModel (R20 is for anomaly detectors; these are numeric parameter range guards, which is R3) - Fix "at most 51×" → "exactly 51×" on MaxInterferenceFactor comment Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): convergence review fixes for interference model (#647) - Add non-vacuity guard to TestInterferenceModel_ClusterIntegration golden test (prevents vacuous pass when no requests complete) - Add INV-PD-1 defensive runtime check in DecodeRoutingEvent.Execute() (detects decode_enqueue < transfer_complete on priority-ordering regression) - Document request-count vs token-count approximation in computeMultiplier (calibration guidance for heterogeneous workloads) - Document intentional parentRequests map retention in cluster.go (clarifies never-pruned-by-design, bounded at <100K for typical sims) Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * fix(sim/cluster): apply PR review fixes for interference model - Fix MaxInterferenceFactor comment: int64 overflow (not float64) - Fix cluster.go comment: MaxInt64 (not MaxFloat64) - Add R20 warning in NewClusterSimulator when interference factors are non-zero but deployment is fully disaggregated (no-op scenario) - Fix INV-PD-2 doc: qualify phase-purity claim for PrefixThresholdDecider - Fix StepTime doc: int64(len(InputTokens)) matches util.Len64 usage - Fix approximation note: use decode-dominant example to match label - Add overflow guard in StepTime before result<1 clamp (R1) - Add compile-time sim.LatencyModel interface assertion - Fix makeBatch: decode ProgressIndex=15 (>len, not just ==len boundary) - Add tied-split symmetry test case (reversed factors) - Add LastAppliedMultiplier atomicity test (3 consecutive calls) - Add PD mode no-op cluster test: BC-P2-10 at cluster level Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

…R4) (#650) * feat(sim): add DirectToDecodeDecider for short-prompt bypass Routes short prompts (len(InputTokens) < threshold) directly to the decode pool, skipping disaggregation. Long prompts continue through the full prefill→transfer→decode pipeline. BC-P2-18: len(InputTokens) < threshold → Disaggregate=false BC-P2-19: empty input → Disaggregate=false Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim): register direct-to-decode in decider bundle Adds "direct-to-decode" to validDisaggregationDeciders map and adds factory panic case directing callers to use NewDirectToDecodeDecider. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): wire DirectToDecodeDecider in cluster constructor Adds PDDirectDecodeThreshold field to DeploymentConfig, switch-based decider construction in NewClusterSimulator, and updates interference warning to only fire for fully-disaggregated deployments (decider=always). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(sim/cluster): pool-filtered routing for non-disaggregated requests (INV-P2-4a) When pools are configured and a request is not disaggregated, route it to the decode pool only via a new poolFilter field on RoutingDecisionEvent. Decode instances handle both phases with interference cost from PR3. BC-P2-14: non-disaggregated + pools → decode pool only BC-P2-15: non-disaggregated → no ParentRequest records Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * feat(cmd): add --pd-direct-decode-threshold CLI flag Registers the --pd-direct-decode-threshold flag (default 256) and wires it to DeploymentConfig. Includes validation (>= 0) and stale-flag warning when used with a different decider. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(sim/cluster): mixed-workload integration test for direct-to-decode Verifies short prompts route directly to decode pool while long prompts go through full PD pipeline (BC-P2-14, BC-P2-15, BC-P2-16, INV-1). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * test(sim/cluster): invariant and backward-compat tests for direct-to-decode - INV-P2-4a: non-disaggregated + pools → decode pool (tested with never decider) - INV-P2-4b/BC-P2-17: interference applied to mixed-phase decode batches - INV-6: determinism for mixed short/long workloads - BC-P2-13: always-disaggregate behavior unchanged by pool filter Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: update CLAUDE.md for direct-to-decode decider (PR4) - Add --pd-direct-decode-threshold to CLI flags listing - Add DirectToDecodeDecider to disaggregation.go file description - Add PDDirectDecodeThreshold to DeploymentConfig description - Add INV-P2-4 (decode-targeted routing) to invariants section - Update disaggregated data flow with direct-to-decode decider option - Update local path to show decode pool routing (INV-P2-4) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * docs: add INV-P2-4, direct-to-decode config reference, and design guidelines update - invariants.md: add INV-P2-3 (interference monotonicity) and INV-P2-4 (decode-targeted routing) with verification strategies - configuration.md: add direct-to-decode decider to table, semantics section, and --pd-direct-decode-threshold flag - design-guidelines.md: add DisaggregationDecider to Section 4.2 module map with all four variants Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * review: fix test quality, snapshot efficiency, and docstring gaps - TestDirectToDecodeDecider_ClusterConstruction: replace nil check with behavioral assertion (decode pool routing) per BDD/TDD principles - TestDirectToDecodeDecider_MixedWorkload: tie INV-1 check to request slice lengths instead of hardcoded constant; clarify sub-request vs parent-request counting in comment - buildPoolFilteredSnapshots: only construct snapshots for target pool members; eliminates O(N/2) wasted Snapshot() calls per routing event - DisaggregationDecisionEvent.Execute: document no-pools fallback path - DirectToDecodeDecider.Decide: add empty-input note to method docstring Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * review: safety guards, PoolOverrides validation, and test coverage gaps Addresses findings from comprehensive PR review: Safety (C1, C3, C6, C8): - C1: Add PDTransferBaseLatencyMs upper-bound guard (<=3.6e9 ms) in NewClusterSimulator to prevent int64 overflow in transfer duration calculation (silent 1 µs clamp bug) - C3: Add PoolOverrides.Validate(name) method enforcing *TP>0, *MaxModelLen>0, *TotalKVBlocks>0; call from NewClusterSimulator before instance construction (R3) - C6: Add BlockSizeTokens>0 check in NewClusterSimulator PD block so error appears at construction time, not as a panic inside Run() (R6) - C8: Add clarifying comments above unreachable panic sites in newInstanceSimulatorCore explaining why they're safe and what to do if the function is ever called outside NewClusterSimulator Documentation (C4, C5): - C4: Update NewDisaggregationDecider docstring to include "direct-to-decode" in panic description - C5: Update PDDecider field comment to include "direct-to-decode" in the valid-values enum Test coverage (GAP-1, GAP-2, GAP-3): - GAP-1: Add INV-P2-4 pool-membership assertion to TestPrefixThreshold_HighThresholdNoDisaggregation - GAP-2: Add TestCollectPDMetrics_DroppedAtDecodeKV table-driven unit test for the TransferCompleteTime>0 && DecodeInstanceID=="" condition - GAP-3: Add INV-P2-4 pool-membership assertion to TestDisaggregationDecisionEvent_SchedulesRouting Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * review: convergence review fixes — O(K) detection, docs, and test coverage Performance: detectPrefillCompletions/detectDecodeCompletions O(N×M) → O(K) via pendingPrefillByInstance/pendingDecodeByInstance per-instance indexing. Pending maps cleared at simulation end to release memory for dropped/horizon- truncated requests. Validation: warn when prefill+decode < total instances (unassigned idle). Tests: TestDisaggregation_MaxModelLen_DropsOversizedRequests — exact drop count, CompletedRequests verification, and INV-1 conservation check. Docs: length_capped_requests anomaly counter; README "Four latency modes"; quickstart PD examples; tutorial Step 8 PD break-even walkthrough; extension-recipes tier-composition pattern; pd-disaggregation-demo.yaml. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * review: nil guard, godoc hardening, validate tests, and boundary coverage Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * review: convergence review fixes — per-pool maxModelLen cap, O(K) snapshot, contention accuracy, and docs Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> * review: nil guard, spurious warning fix, dead comment, topology guard, and warning accuracy - sim/disaggregation.go: add nil guard in DirectToDecodeDecider.Decide (panics with named message instead of NPE deep in event loop) - sim/cluster/cluster_event.go: fix spurious "unhandled pool role" warning — known roles without custom policies fall through silently; warning reserved for unrecognized roles only - sim/cluster/cluster_event.go: replace dead-code-path comment "When pools are NOT configured" with accurate "Defensive: always true here" note - cmd/root.go: fix misleading zero-threshold warning — qualifies "for non-empty inputs" and notes the empty-input exception - cmd/root.go: fix topology guard (&&→||) to catch the one-pool-zero case with the friendly "requires both" message Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]> --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

… PD mode (#687) In PD disaggregation, a decode sub-request with 1 output token enters with ProgressIndex == inputLen. After one decode step, ProgressIndex becomes inputLen+1, which overshot the == completion check and allowed a second erroneous decode step that panicked in AllocateKVBlocks with index out of range. Two fixes: - Use >= instead of == in completion check (catches overshoot) - Add bounds guard on final-token AllocateKVBlocks (prevents OOB access) Adds regression test TestPDDisagg_OneOutputToken_NoPanic. Co-authored-by: Claude Sonnet 4.6 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Uh oh!

Commits on Mar 11, 2026

Commits on Mar 12, 2026

Commits on Mar 13, 2026

Commits on Mar 14, 2026

Commits on Mar 16, 2026

This comparison is taking too long to generate.

Uh oh!