Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: inference-sim/inference-sim
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main
Choose a base ref
...
head repository: inference-sim/inference-sim
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: pd
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 12 commits
  • 117 files changed
  • 5 contributors

Commits on Mar 11, 2026

  1. feat(sim): pool topology and disaggregation decision pipeline (#599)

    * feat(sim): add pool topology and disaggregation decision pipeline (#591)
    
    Establishes PD (Prefill-Decode) disaggregation foundation: pool topology
    as a first-class cluster concept and the disaggregation decision point in
    the event pipeline. When pools are not configured (--prefill-instances=0
    --decode-instances=0), output is byte-identical to pre-PR1 behavior
    (BC-PD-1 verified at seeds 42, 123, 999).
    
    - Add DisaggregationDecider interface with NeverDisaggregate and
      AlwaysDisaggregate implementations (sim/disaggregation.go)
    - Add PoolRole type, ValidatePoolTopology, BuildPoolMembership
      (sim/cluster/pool.go)
    - Add DisaggregationDecisionEvent (priority 3) to cluster event pipeline
    - Conditional branch in AdmissionDecisionEvent.Execute: pools configured
      → DisaggregationDecisionEvent, otherwise → RoutingDecisionEvent
    - Add PrefillInstances, DecodeInstances, PDDecider fields to
      DeploymentConfig (zero-value = disabled, all existing construction
      sites backward-compatible per R4 audit)
    - Add --prefill-instances, --decode-instances, --pd-decider CLI flags
    - Update CLAUDE.md with new files and flags
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(sim): convergence review fixes for pool topology PR
    
    Address 4 IMPORTANT findings from pr-code convergence review:
    
    1. PoolMembership() now returns a defensive copy instead of the
       internal map, complying with R8 (no exported mutable maps)
    2. CLI now calls ValidatePoolTopology at the boundary for clean
       error messages instead of relying on library panic
    3. Factory test verifies concrete type via fmt.Sprintf("%T"),
       fixing unused wantType field
    4. Integration tests verify INV-1 (request conservation) and
       INV-5 (causality) after simulation with disaggregation enabled
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    fe7f465 View commit details
    Browse the repository at this point in the history
  2. feat(sim): end-to-end disaggregated request flow (#592) (#617)

    * feat(cluster): add ParentRequest type and FilterSnapshotsByPool (BC-PD-5, BC-PD-7, BC-PD-9)
    
    - Add ParentRequest struct for tracking disaggregated request lifecycle
    - Add FilterSnapshotsByPool for pool-scoped routing
    - Foundation types for PR2 disaggregated request flow
    
    Co-Authored-By: Claude <[email protected]>
    
    * feat(cluster): add PD transfer config fields and CLI flags (EC-PD-2, BC-PD-14)
    
    - Add PDTransferBandwidthGBps, PDTransferBaseLatencyMs, PDKVBytesPerToken to DeploymentConfig
    - Add PrefillScorerConfigs, DecodeScorerConfigs for per-pool routing
    - Add CLI flags: --pd-transfer-bandwidth, --pd-transfer-base-latency, --pd-kv-bytes-per-token
    - Add CLI flags: --prefill-routing-scorers, --decode-routing-scorers
    - Validation: bandwidth finite positive, base latency finite non-negative, bytes-per-token > 0 (R3, R11)
    
    Co-Authored-By: Claude <[email protected]>
    
    * feat(cluster): implement end-to-end disaggregated request flow (BC-PD-5 through BC-PD-14)
    
    - Bifurcate DisaggregationDecisionEvent: disaggregate=true → PrefillRoutingEvent, false → RoutingDecisionEvent
    - Add 4 new event types: PrefillRoutingEvent (4), KVTransferStartedEvent (5), KVTransferCompletedEvent (6), DecodeRoutingEvent (7)
    - Add pool-filtered snapshot building with buildPoolFilteredSnapshots helper
    - Add prefill completion detection in cluster event loop via pendingPrefillCompletions map
    - Implement KV transfer pipeline: compute duration from block count and bandwidth, schedule events
    - Add AllocateTransferredKV and InjectDecodeOnline to InstanceSimulator
    - Add EnqueueDecodeSubRequest to Simulator (bypasses oversized guard and TotalInputTokens counting)
    - Add decode-only batch formation path in VLLMBatchFormation Phase 2
    - Initialize per-pool routing policies with separate RNG partitions
    - Update PR1 test for disaggregated conservation counting
    
    Co-Authored-By: Claude <[email protected]>
    
    * test(cluster): comprehensive disaggregation invariant and integration tests (BC-PD-5 through BC-PD-15)
    
    - Pool exclusivity: prefill/decode sub-requests routed to correct pools (BC-PD-7)
    - Full path completion: requests complete through disaggregated pipeline (BC-PD-5)
    - Phase causality: timestamps form valid causal chain (INV-PD-4)
    - Transfer conservation: initiated == completed (INV-PD-3)
    - Pool stability: membership unchanged after simulation (INV-PD-5)
    - Determinism: same seed produces identical metrics (INV-6)
    - Backward compatibility: non-disaggregated path unchanged (BC-PD-13)
    - Per-pool scorer configs: separate routing policies work end-to-end (BC-PD-15)
    - AllocateTransferredKV: success and insufficient capacity cases
    
    Co-Authored-By: Claude <[email protected]>
    
    * docs: update CLAUDE.md and add PR2 implementation plan
    
    - Add disaggregated data flow diagram to Key Data Flow section
    - Add INV-PD-1 through INV-PD-5 to Key Invariants
    - Add new CLI flags to cmd/root.go description
    - Add pd_events.go and parent_request.go to file organization
    - Include PR2 micro plan in docs/plans/
    
    Co-Authored-By: Claude <[email protected]>
    
    ---------
    
    Co-authored-by: Claude <[email protected]>
    namasl and claude authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    c426dda View commit details
    Browse the repository at this point in the history
  3. feat(sim/cluster): add disaggregation-aware metrics (PR3) (#619)

    * feat(cluster): add PDMetrics struct and CollectPDMetrics (BC-1,BC-2,BC-4,BC-5,BC-6,BC-7,BC-10,BC-11)
    
    - PDMetrics: DisaggregatedCount, ParentTTFT, TransferDuration,
      PrefillThroughput, DecodeThroughput, LoadImbalanceRatio
    - CollectPDMetrics: pure function, R2-compliant sort, BC-11 TTFT filter
    - collectPoolThroughput: R11-guarded division, math.MaxFloat64 sentinel
    - 10 unit + integration tests covering all BC edge cases
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(cluster): add ParentRequests and PerInstanceMetricsByID accessors (BC-11,BC-12)
    
    - ParentRequests(): returns sorted-by-ID slice after Run() (R2, BC-11)
    - PerInstanceMetricsByID(): returns copy map after Run() (R8, BC-12)
    - Fix ParentTTFT to use PrefillSubReqID (decode sub-reqs don't record TTFT
      as ProgressIndex starts at inputLen; prefill sub-req TTFT = first token time)
    - Add 4 integration+invariant tests: accessor coverage, pool conservation
      (BC-3), BC-1 causality invariant
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(cluster): add PD *PDMetrics field to RawMetrics (BC-9)
    
    - RawMetrics.PD: nil when disaggregation inactive (Go zero value)
    - Named-field construction at all sites remains safe (R4)
    - Test BC-9: PD == nil in non-disaggregated CollectRawMetrics path
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(cmd): wire CollectPDMetrics and add printPDMetrics output (PR3)
    
    - Wire rawMetrics.PD = CollectPDMetrics(...) after CollectRawMetrics
    - Add printPDMetrics: prints === PD Metrics === section when pd != nil
    - Outputs DisaggregatedCount, throughput, LoadImbalanceRatio (inf sentinel),
      ParentTTFT percentiles, KV Transfer Duration percentiles
    - 2 output tests: nil no-op, section content verification
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * docs(claude): update file organization for PR3 pd_metrics.go
    
    - Add pd_metrics.go entry to sim/cluster/ tree
    - Update cluster.go entry with new accessor methods
    - Update metrics.go entry to mention PD *PDMetrics field
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): convergence review Round 1 fixes for PR3 pd-metrics
    
    Critical fixes:
    - pd_events.go: increment droppedAtDecodeKV on KV allocation failure (R1/INV-1)
    - cluster.go: account for droppedAtDecodeKV in aggregated DroppedUnservable after Run()
    - cluster.go: panic on nil parent in detectPrefillCompletions (R1, was silent continue)
    - docs/contributing/standards/invariants.md: add INV-PD-1 through INV-PD-5 (cross-doc fix)
    
    Important fixes:
    - pd_metrics.go: nil guard for aggregated param in CollectPDMetrics
    - pd_metrics.go: nil guard for *sim.Metrics entries in collectPoolThroughput
    - pd_metrics_test.go: fix BC-1 comment (was DecodeSubReqID, correct is PrefillSubReqID)
    - disaggregation_test.go: t.Skip → t.Fatal in BC-1 causality test (masks regressions)
    - disaggregation_test.go: remove internal field access from ParentRequests/PerInstanceMetrics tests
    - disaggregation_test.go: add TestClusterSimulator_DisaggregatedINV1_Conservation (R7)
    - docs/guide/results.md: document === PD Metrics === output section
    - docs/reference/configuration.md: document PD disaggregation flags
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): convergence review round 2 fixes for PR3 PD metrics
    
    - Fix R1: pd_metrics.go collectPoolThroughput panics on nil *sim.Metrics
      instead of silently skipping (was: silent continue, no diagnostic)
    - Fix R1: CollectPDMetrics panics when aggregated==nil with non-empty
      parents instead of silently returning nil (undocumented data loss path)
    - Fix INV-1 test: TestClusterSimulator_DisaggregatedINV1_Conservation
      now asserts the conservation identity (completed+queued+running+dropped
      == 2*N injected sub-requests) rather than a bare value check
    - Fix plan doc: update Section A "Key insight" and Section E "THE TRICKY
      PART" to describe PrefillSubReqID mechanism (not DecodeSubReqID) — the
      plan shipped with the original incorrect design that was corrected during
      implementation
    - Add PD disaggregation section to docs/guide/cluster.md with usage note
      warning that --pd-decider always is required to activate PD mode
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): convergence review round 3 fixes for PR3 PD metrics
    
    Critical bug fix — INV-PD-4 stale instance clock:
    - sim/simulator.go: add currentTime int64 parameter to EnqueueDecodeSubRequest;
      idle StepEvent now uses cluster clock (not stale instance sim.Clock), preventing
      decode completions from firing before decode enqueue time
    - sim/cluster/instance.go: propagate currentTime through InjectDecodeOnline
    - sim/cluster/pd_events.go: pass e.time to InjectDecodeOnline in DecodeRoutingEvent
    
    Decode completion tracking (required to set parent.CompletionTime):
    - sim/cluster/cluster.go: add pendingDecodeCompletions map; add detectDecodeCompletions;
      extend event loop to call detectDecodeCompletions on decode pool instances
    - sim/cluster/cluster.go: fix detectPrefillCompletions to collect+sort keys before
      processing (R2/INV-6 determinism: seqIDs must be assigned in stable order)
    
    Orphaned parent record fix (IMPORTANT from PC-8/PC-1):
    - sim/cluster/pd_events.go: set e.parentReq.CompletionTime = e.time when decode
      KV allocation fails, preventing ParentRequests() from returning records in limbo
      (TransferCompleteTime set but CompletionTime = 0)
    
    Test quality fixes (IMPORTANT from PC-3):
    - disaggregation_test.go: replace structural nil-check assertions in
      TestDisaggregation_PerPoolScorerConfigs with behavioral comment; rely on
      TotalOutputTokens > 0 as observable behavioral assertion (BDD/TDD refactor survival)
    - disaggregation_test.go: fix zero-timestamp check in TestDisaggregation_PhaseCausality
      to use CompletionTime > 0 (latency model always produces ≥1 step duration)
    - disaggregation_test.go: replace internal field access with public accessors
      (cs.parentRequests → cs.ParentRequests(), cs.poolMembership → cs.PoolMembership())
    - disaggregation_test.go: add TestDisaggregation_NoCrossPoolRouting (NC-PD-1)
    
    Documentation (IMPORTANT from PC-4/PC-7):
    - docs/guide/cluster.md: add PD Troubleshooting section and Known Simplifications
      subsection documenting atomic transfer, no cross-instance preemption, no transfer
      retry, and fixed pool sizes as Phase 1 limitations
    - docs/guide/results.md: clarify PD metrics are stdout-only, not in JSON file
    - docs/contributing/extension-recipes.md: expand disaggregation decider recipe step 3
      with full CLI parameter wiring code example
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim/cluster): convergence review fixes for PD metrics PR (Round 3)
    
    - docs/contributing/extension-recipes.md: fix DisaggregationDecider recipe
      - Wrong enum constants (DisaggregationDecisionDisaggregate/Local) replaced with
        correct DisaggregationDecision{Disaggregate: true/false} struct literals
      - Factory wiring now shows correct pattern: extend NewDisaggregationDecider
        signature to accept extra parameter (avoids import cycle) and update call site
        in cluster.go
    - cmd/root.go: use math.IsInf(pd.LoadImbalanceRatio, 1) instead of fragile
      >= math.MaxFloat64/2 comparison for infinity sentinel detection
    - sim/cluster/disaggregation_test.go: add TestDisaggregation_INV_PD_1_DecodeEnqueueAfterTransfer
      standalone R7 companion invariant test for INV-PD-1 (DecodeEnqueueTime >= TransferCompleteTime)
    - docs/guide/results.md: clarify DisaggregatedCount includes requests subsequently
      dropped at decode KV allocation; add aggregate-vs-realtime caveat to Load Imbalance
      Ratio note; add 2N JSON rows note explaining PD mode produces two sub-request rows
      per original request
    - sim/cluster/parent_request.go: document CompletionTime dual semantics (actual
      decode completion vs. decode routing event time for dropped parents)
    - docs/guide/cluster.md: add sizing guidance for high DroppedUnservable in PD mode
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim/cluster): convergence review fixes for PD metrics PR (Round 4)
    
    - docs/guide/cluster.md: correct misleading "fails silently" text; topology
      validation calls logrus.Fatalf and exits loudly, not silently
    - sim/cluster/disaggregation_test.go:190: use cs.ParentRequests() public
      accessor instead of cs.parentRequests unexported field (BDD consistency)
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(cmd): fix LoadImbalanceRatio sentinel check in printPDMetrics
    
    The sentinel for "one pool idle" in pd_metrics.go is math.MaxFloat64
    (finite), not math.Inf(1). math.IsInf(math.MaxFloat64, 1) returns false,
    so the "inf (one pool idle)" display branch was unreachable after the
    Round 3 fix.
    
    Replace math.IsInf with pd.LoadImbalanceRatio == math.MaxFloat64 to
    precisely match the sentinel and restore correct output formatting.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * test(cmd): add TestPrintPDMetrics_LoadImbalanceRatio_OnePoolIdle for sentinel check
    
    Adds behavioral test verifying that printPDMetrics outputs "inf (one pool idle)"
    when LoadImbalanceRatio == math.MaxFloat64 (the BC-10 sentinel). Ensures the
    sentinel check introduced in aa6a7b9 (== math.MaxFloat64) is exercised and will
    catch regressions.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * test(sim/cluster): tighten LoadImbalanceRatio sentinel invariant test to use exact == math.MaxFloat64
    
    The ZeroMinGuard test used `< math.MaxFloat64/2` as the failure condition,
    which accepts any large float >= MaxFloat64/2. Changed to `!= math.MaxFloat64`
    so the test precisely matches the sentinel check in printPDMetrics (BC-10).
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Sonnet 4.6 <[email protected]>
    namasl and claude authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    8f7d856 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2026

  1. feat(trace): PD disaggregation trace instrumentation (PR4) (#618)

    * feat(trace): add PD disaggregation trace record types (BC-PD-17)
    
    - Add DisaggregationRecord, PrefillRoutingRecord, DecodeRoutingRecord, KVTransferRecord
    - Add Disaggregations, PrefillRoutings, DecodeRoutings, KVTransfers slices to SimulationTrace
    - Add RecordDisaggregation, RecordPrefillRouting, RecordDecodeRouting, RecordKVTransfer methods
    - Update NewSimulationTrace constructor to initialize all new slices
    
    Co-Authored-By: Claude <[email protected]>
    
    * feat(cluster): instrument DisaggregationDecisionEvent with trace recording (BC-PD-18)
    
    - Record DisaggregationRecord for every disaggregation decision when trace enabled
    - Non-disaggregated mode: no disaggregation records (BC-PD-18)
    - Update ClusterEvent.Priority() comment to include full PD priority range (4-7)
    - Add TestPDTrace_NonDisaggMode_NoDisaggRecords and TestPDTrace_DisaggMode_DisaggDecisionRecorded
    
    Co-Authored-By: Claude <[email protected]>
    
    * feat(cluster): instrument PD event handlers with trace recording (BC-PD-17, BC-PD-19)
    
    - Instrument PrefillRoutingEvent with PrefillRoutingRecord + counterfactual support
    - Instrument DecodeRoutingEvent with DecodeRoutingRecord + counterfactual + KVTransferRecord
    - KVTransferRecord recorded in DecodeRoutingEvent so DecodeInstanceID is fully populated
    - Add TestPDTrace_DisaggMode_AllRecordTypesPresent (BC-PD-17)
    - Add TestPDTrace_DisaggMode_Counterfactual (BC-PD-19)
    
    Co-Authored-By: Claude <[email protected]>
    
    * docs: update CLAUDE.md trace/ descriptions for PR4 PD trace records
    
    Co-Authored-By: Claude <[email protected]>
    
    * fix(cluster,trace): address convergence review findings (Round 1)
    
    R1: increment droppedKVAllocations counter when AllocateTransferredKV
    fails in DecodeRoutingEvent — silent drop violated INV-PD-3.
    
    R5: move DecodeInstanceID/DecodeEnqueueTime assignment to after
    successful KV allocation — eliminates stale state on failure path.
    
    trace: extend TraceSummary with PD fields (DisaggregationCount,
    DisaggregatedCount, KVTransferCount, MeanTransferDuration) so
    --summarize-trace output is meaningful for disaggregated runs.
    
    test: add TestPDTrace_DisaggMode_Cardinality invariant test (R7)
    asserting len(PrefillRoutings)==len(KVTransfers)==len(DecodeRoutings).
    
    test: remove fragile hardcoded count from TransferConservation test —
    assert conservation law only, not absolute count.
    
    docs: update extension-recipes.md to mention pd_events.go as second
    hook site for PD-specific trace records.
    
    docs: add PD trace record descriptions to docs/guide/cluster.md
    Decision Tracing section.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(cluster,trace): address convergence review findings (Round 2)
    
    - Print PD summary fields in --summarize-trace output (cmd/root.go)
    - Add DroppedKVAllocations() accessor to ClusterSimulator (R1 testability)
    - Add INV-PD-1..5 to docs/contributing/standards/invariants.md
    - Fix cardinality invariant test comment: general law uses DisaggregatedCount not len(Disaggregations)
    - Add R7 invariant test for DroppedKVAllocations counter (zero under ample capacity)
    - Add R7 invariant tests for new TraceSummary PD fields in summary_test.go
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * docs: fix extension recipe for trace records (Round 3 fixes)
    
    - Correct KVTransferRecord hook location: recorded in DecodeRoutingEvent
      (not KVTransferStartedEvent) because DecodeInstanceID is only known at
      decode routing time
    - Add missing step 6: update --summarize-trace output block in cmd/root.go
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * docs: add PR4 micro-plan for PD trace instrumentation
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(traces): address post-submission review findings
    
    - Fix INV-PD-1 verification citation: phase-causality tests live in
      disaggregation_test.go, not pd_traces_test.go
    - Surface DroppedKVAllocations in Anomaly Counters CLI output (R1: no
      silent data loss; visible even without --summarize-trace)
    - Add TransferStartTime > 0 assertion in TestPDTrace_DisaggMode_AllRecordTypesPresent
      (R7: invariant test for timestamp completeness)
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(traces): address convergence review findings (Round 4)
    
    - Add DroppedKVAllocations field to RawMetrics and update CollectRawMetrics
      signature to propagate PD decode-OOM counter through the standard metrics
      API (was only accessible via cs.DroppedKVAllocations() accessor)
    - Add TestCollectRawMetrics_DroppedKVAllocations invariant test
    - Add defensive clamp for TransferDuration (INV-PD-4 guarantees non-negative,
      but clamp to 0 if ordering invariant is ever violated)
    - Document Scores map semantics in PrefillRoutingRecord/DecodeRoutingRecord
      (higher=more preferred, raw weighted-scorer output, not normalized)
    - Document PD trace activation conditions in extension-recipes.md
      (both --trace-level and pool flags required)
    - Document pool-filtered snapshot requirement for counterfactual computation
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(traces): address convergence review findings (Round 5)
    
    - Add logrus.Warnf when defensive TransferDuration clamp triggers, making
      INV-PD-4 ordering violations detectable in logs (R1: no silent data loss)
    - Add PD trace usage example with expected output to docs/guide/cluster.md,
      including guidance on interpreting Disaggregation Decisions, KV Transfers,
      and Mean Transfer Duration metrics
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(traces): address PR review findings (comment accuracy, doc invariants)
    
    - Fix INV-PD-4 comment in pd_events.go: transfer_start ≤ transfer_complete
      is guaranteed by timestamp sequencing (duration >= 1µs), not priority ordering
    - Fix R15 stale PR reference in cluster_event_test.go test docstring
    - Add per-request ordering note to DisaggregationDecisionEvent docblock
    - Add cross-record invariant doc to DisaggregationRecord (paired records guarantee)
    - Add Regret >= 0 invariant annotation to PrefillRoutingRecord and DecodeRoutingRecord
    - Add TransferDuration enforcement-location note to KVTransferRecord
    - Clarify TargetDistribution scope in TraceSummary (standard routing only, not PD pool routing)
    - Update INV-PD-4 verification in invariants.md to correctly describe timestamp sequencing
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(pd): address PR review findings — validation, diagnostics, tests, docs
    
    Critical fixes:
    - Fix stale DisaggregationDecisionLocal reference in extension-recipes.md
    - Fix incorrect pool topology troubleshooting (== → <=) in cluster.md
    - Fix misattributed population site comments in cluster.go
    
    Important improvements:
    - Add R3 validation for PD transfer parameters (PDKVBytesPerToken,
      PDTransferBandwidthGBps, PDTransferBaseLatencyMs) in NewClusterSimulator
    - Add pool-filtered snapshot guards with specific panic messages (I3)
    - Add post-simulation INV-PD-3 transfer conservation check
    - Add post-simulation diagnostics for orphaned pending completions
    - Add DroppedAtDecodeKV field to PDMetrics for mid-pipeline drop visibility
    - Surface DroppedAtDecodeKV in CLI output when > 0
    - Add test for decode KV allocation failure path with INV-1 conservation
    - Add test for negative transfer duration clamp (INV-PD-4 defensive path)
    - Fix RawMetrics and bundle.go var block alignment inconsistencies
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude <[email protected]>
    namasl and claude authored Mar 12, 2026
    Configuration menu
    Copy the full SHA
    4b4e734 View commit details
    Browse the repository at this point in the history
  2. feat(sim): add PrefixThresholdDecider for prefix-aware PD disaggregat…

    …ion (PR5) (#620)
    
    * feat(sim): add PrefixThresholdDecider and DisaggregationObserver
    
    Adds PrefixThresholdDecider: disaggregates when non-cached token count
    exceeds threshold. Maintains a router-side PrefixCacheIndex under a
    single globalVirtualInstance key to track cluster-wide prefix knowledge.
    
    Also adds DisaggregationObserver interface for stateful deciders that
    learn from routing decisions (ObserveRouting called synchronously by
    ClusterSimulator after each routing decision).
    
    Implements cachedHashes/cachedReqID pattern (mirrors routing_prefix_scorer.go)
    to avoid double-hashing between Decide() and ObserveRouting().
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(sim): register prefix-threshold in validDisaggregationDeciders
    
    Adds "prefix-threshold" to the valid disaggregation decider names in
    bundle.go, enabling IsValidDisaggregationDecider() and
    ValidDisaggregationDeciderNames() to recognize the new decider.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(sim/cluster,cmd): add PDPrefixThreshold config field and CLI flag
    
    Adds PDPrefixThreshold int to DeploymentConfig for the prefix-threshold
    decider's non-cached token threshold. Adds --pd-prefix-threshold CLI flag
    (default 512) with >= 0 validation. Updates --pd-decider description to
    include "prefix-threshold". Wires PDPrefixThreshold into the config
    construction in cmd/root.go.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(sim/cluster): wire PrefixThresholdDecider and DisaggregationObserver
    
    - cluster.go: branch on PDDecider=="prefix-threshold" to construct
      PrefixThresholdDecider(PDPrefixThreshold, BlockSizeTokens) directly;
      add notifyDisaggregationObserver helper
    - cluster_event.go: call notifyDisaggregationObserver after RoutingDecisionEvent
      injection (standard routing path, BC-PD-28)
    - pd_events.go: call notifyDisaggregationObserver after PrefillRoutingEvent
      injection (disaggregated path, BC-PD-28)
    - disaggregation_test.go: add integration tests verifying wiring, high/low
      threshold behavior, observer call, and transfer conservation
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * chore: update CLAUDE.md for PrefixThresholdDecider and --pd-prefix-threshold
    
    Documents DisaggregationObserver interface, PrefixThresholdDecider, and
    PDPrefixThreshold in the file organization table. Adds --pd-prefix-threshold
    to the CLI flags list and the disaggregated data flow CLI flags summary.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): convergence review fixes for prefix-threshold decider PR
    
    Round 1 fixes:
    - Update DisaggregationObserver docstring with R13 note and R17 signal
      freshness guarantee
    - Add noopDisaggregationObserver to satisfy R13 (>=2 implementations)
    - Add explicit comment in DecodeRoutingEvent explaining why observer
      is intentionally not called on the decode sub-path
    - Add 'Adding New Disaggregation Deciders' section to extension-recipes.md
      covering both stateless and stateful patterns
    - Fix Decide() docstring: clarify hash-reuse fires only on non-disaggregated
      path (disaggregated path receives prefill sub-request with different ID)
    
    Round 2 fixes:
    - Remove structural TestPrefixThreshold_DeciderWiredCorrectly; replace
      with compile-time interface assertion (behavioral coverage exists in
      ZeroThresholdAlwaysDisaggregates and HighThresholdNoDisaggregation)
    - Add PDPrefixThreshold field to newTestDisaggDeploymentConfig (R4
      construction site audit)
    - Add PD disaggregation row to CLI Flag Summary table in configuration.md
      including --pd-prefix-threshold and all PD flags introduced in PR1-PR4
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): convergence review fixes for prefix-threshold decider PR (Round 3)
    
    - Add logrus.Warn when --pd-prefix-threshold is explicitly set but
      --pd-decider is not "prefix-threshold" (silent flag ignored, R3 spirit)
    - Add PD Disaggregation section to docs/reference/configuration.md
      explaining decider options, prefix-threshold semantics (non-cached
      tokens vs total tokens), default value meaning, and all PD flags
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim/cluster): strengthen BC-PD-28 test to verify observer cache-warming effect
    
    TestPrefixThreshold_ObserverWarmsCache replaces the previous
    TestPrefixThreshold_ObserverCalledAfterRouting which only checked
    causality invariants (TransferCompleteTime != 0) and did not verify
    that the DisaggregationObserver actually warmed the prefix cache.
    
    The new test uses two requests with a shared 192-token prefix
    (threshold=150): req1 disaggregates (192 > 150 with empty cache),
    observer records 12 blocks, req2 arrives later with the same prefix
    (58 non-cached tokens ≤ 150) and is routed locally — proving the
    observer was called and the cache was warmed.
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * docs: fix extension-recipes step count and remove broken configuration.md cross-link
    
    - extension-recipes.md: add missing step 4 (update configuration.md table) and
      fix touch-point count from 3 to 4 for stateless disaggregation deciders
    - configuration.md: replace broken cross-link to non-existent
      architecture.md#pd-disaggregation with descriptive text documenting
      the pool topology constraint (prefill + decode <= num-instances)
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix(sim): restore R1 compliance and state mutation order in DecodeRoutingEvent
    
    - Restore droppedKVAllocations counter and DroppedKVAllocations() accessor
      (R1: count dropped requests — never silent)
    - Move AssignedInstance/DecodeInstanceID/DecodeEnqueueTime assignment to
      after successful AllocateTransferredKV check to prevent inconsistent state
      on failure
    - Restore DroppedKVAllocations() in cmd/root.go anomaly counter output
    - Add INV-PD-1 through INV-PD-5 back to invariants.md for DRY compliance
      (these were in CLAUDE.md but not the canonical standards doc)
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * fix: address review findings — silent config failure, stale docs, structural tests
    
    Critical: Add CLI validation that --pd-decider other than "never" requires
    --prefill-instances and --decode-instances to be set (R1: no silent config
    failure).
    
    Important:
    - Remove superseded disaggregation decider recipe (incorrect field names,
      wrong flag names, stale purity contract) from extension-recipes.md
    - Add prefix-threshold to first PD flags table in configuration.md
    - Replace internal field accesses (cs.parentRequests, cs.transfersInitiated,
      cs.trace, cs.droppedAtDecodeKV) with public accessors (cs.ParentRequests(),
      cs.Trace(), cs.DroppedKVAllocations()) in disaggregation_test.go for
      refactor survival (BDD/TDD principle #5)
    
    Also includes minor comment condensations from code-simplifier agent.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs(guide): document Dropped KV Allocations anomaly counter
    
    Add the new PD-mode anomaly counter to the results guide so users
    understand what it means when decode KV allocation fails.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Sonnet 4.6 <[email protected]>
    namasl and claude authored Mar 12, 2026
    Configuration menu
    Copy the full SHA
    db058d3 View commit details
    Browse the repository at this point in the history
  3. merge: bring pd branch up to date with main (#626)

    * feat(sim): MaxModelLen enforcement and MaxOutputLen budget (#567) (#579)
    
    Add vLLM-equivalent max_model_len enforcement at three layers:
    
    1. Startup validation: ceil(MaxModelLen/BlockSize) <= TotalKVBlocks
    2. Enqueue guard: input >= MaxModelLen rejected (matching vLLM serving.py:1542);
       input + MaxOutputLen > MaxModelLen rejected when client declares budget
    3. Runtime stop: force-complete at ProgressIndex >= MaxModelLen (defense-in-depth)
    
    Key design decisions:
    - Oracle Knowledge Boundary (INV-9): control plane never reads OutputTokens.
      Uses MaxOutputLen (client budget) or input-only check. Runtime stop handles
      output growth. Verified by behavioral + structural grep tests.
    - Auto-derive from HF max_position_embeddings for roofline/crossmodel backends,
      with rope_scaling blacklist (excludes su/longrope/llama3 per vLLM), yarn
      special-case using original_max_position_embeddings, and KV-feasible capping.
    - Overflow-safe ceiling division in startup validation (R11).
    - R3 validation at CLI (logrus.Fatalf) and constructor (panic).
    
    New tests (12): BC-1 through BC-5, BC-7 conservation with drops, boundary tests
    (input==MaxModelLen, exact fit), R3 constructor panic, INV-9 structural enforcement.
    
    Partially addresses #529 (reasoning workload livelock) for roofline/crossmodel.
    Blackbox gap tracked in #578.
    
    Closes: #567
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(latency): MoE-aware roofline latency model (#559) (#561)
    
    * feat(sim): add MoEExpertFFNDim and SharedExpertFFNDim to ModelConfig
    
    Two new fields for MoE-aware roofline: per-routed-expert FFN dimension
    and total shared-expert FFN dimension. Both default to 0 (dense model).
    Zero-value safe for all existing construction sites (R4 audit: all
    dense model configs use zero-valued MoE fields).
    
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * feat(latency): parse MoE per-expert and shared-expert dims from HF config
    
    Extends GetModelConfigFromHF to parse moe_intermediate_size,
    shared_expert_intermediate_size, and n_shared_experts. Expert count
    resolution chain extended to include num_routed_experts (DeepSeek-V3).
    
    Implements BC-15 through BC-18 from the MoE roofline design.
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * feat(latency): add MoE consistency validation to ValidateRooflineConfig
    
    Validates: experts>0 requires active>0, active<=total, non-negative
    MoE dimensions. Catches inconsistent MoE configs at construction time.
    
    Implements BC-12, BC-13, BC-14 from MoE roofline design.
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(sim): address convergence review findings (I-1, I-2)
    
    I-1: Align SharedExpertFFNDim JSON tag to shared_expert_intermediate_size
         (matches HF config field name convention, consistent with other tags).
    I-2: Add negative NumLocalExperts validation in ValidateRooflineConfig
         (R3 compliance — all numeric parameters validated).
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * feat(latency): MoE-aware FLOPs, active weight bandwidth, and smoke tests
    
    MoE FLOPs (Task 4): calculateTransformerFlops now computes routed
    (top_k), shared, and gate MLP FLOPs for MoE models. Dense models
    use unchanged code path (NumLocalExperts=0 guard).
    
    Active weights (Task 5): calculateMemoryAccessBytes uses top_k
    (active experts) for per-step weight bandwidth, matching vLLM's
    fused_moe kernel behavior. Includes shared expert and gate weights.
    
    Smoke tests (Task 7): Mixtral-8x7B and DeepSeek-V3 step time smoke
    tests plus dense regression anchor (TP=1=12151µs, TP=2=6820µs).
    
    Implements BC-1 through BC-6, BC-10 from MoE roofline design.
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): use per-expert FFN dim for MoE KV capacity weight estimation
    
    Fixes the critical bug where DeepSeek-V3's general intermediate_size
    (18432) was used as per-expert dim (should be 2048), overestimating
    MLP weights by ~9× and returning zero usable KV blocks.
    
    Changes:
    - KVCapacityParams gains MoEExpertFFNDim and SharedExpertFFNDim fields
    - NewKVCapacityParams gains 2 new positional args (R4 enforced)
    - computeModelWeightBytes uses per-expert dim when nonzero, falls back
      to IntermediateDim (Mixtral convention)
    - ExtractKVCapacityParams propagates new fields, extends expert count
      chain to include num_routed_experts (parity with GetModelConfigFromHF)
    
    Implements BC-7 (per-expert dim fix), BC-9 (param cross-validation),
    BC-11 (dense unchanged). Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): convergence review round 2 — R23 parity, documentation, R15
    
    I-1: Align expert count resolution threshold between GetModelConfigFromHF
    and ExtractKVCapacityParams. Both now use >1 threshold (single-expert
    models are dense-equivalent). Fixes R23 code path parity violation.
    
    I-2: Add precondition comments to calculateTransformerFlops and
    calculateMemoryAccessBytes documenting ValidateRooflineConfig requirement.
    
    I-3: Document SharedExpertFFNDim "total dim" semantics — correct due to
    SwiGLU linearity (N × (3 × d × e) == 3 × d × (N × e)).
    
    I-4: Add R15 staleness notes to hardening-validation-cleanup-plan.md and
    pr2-kv-capacity-auto-calculate-plan.md (NewKVCapacityParams now 6-arg).
    
    I-5: Document active vs total weight distinction in calculateMemoryAccessBytes
    to prevent future R23 regression.
    
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): align MoE threshold to > 1 across all consumption paths (R23)
    
    Parsing layer already used > 1 (single-expert models are
    dense-equivalent). Consumption paths (calculateTransformerFlops,
    calculateMemoryAccessBytes, crossmodel isMoE, ValidateRooflineConfig,
    computeModelWeightBytes) now use > 1 as well, matching the documented
    design intent and resolving the R23 code path parity violation.
    
    Also fixes stale doc comment in ExtractKVCapacityParams ("> 0" → "> 1").
    
    Round 3 convergence review fixes.
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(cmd): update stale MoE warning, gofmt alignment, R15 crossmodel plan
    
    - cmd/root.go: Replace misleading "assumes dense transformers" warning
      with accurate MoE info message (roofline now models per-expert FLOPs)
    - sim/model_hardware_config.go: Run gofmt to fix struct field alignment
    - docs/plans/pr472b-crossmodel-backend-plan.md: Add R15 staleness note
      for threshold change (> 0 → > 1)
    
    Round 4 convergence review fixes.
    Part of #559
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * refactor(latency): port llm-optimizer single-crossover roofline physics
    
    Replace dual-ceiling model (GEMM + vector ceilings) with single-crossover:
      step_time = max(total_flops / (peak * MFU), total_bytes / peak_bandwidth)
    
    Remove bandwidth haircut (BwEffConstant no longer used in step time).
    Remove all overhead terms (TOverheadMicros, PerLayerOverhead, AllReduceLatency).
    
    Keeps BLIS's superior model-awareness: actual IntermediateDim, SwiGLU
    3-matrix MLP, MoE support, FlashAttention-aware memory model.
    
    Motivation: BLIS roofline has 215% ITL MAPE vs llm-optimizer's 36.5%.
    The dual ceiling + bandwidth haircut + overhead stacking caused ~3x
    systematic over-prediction for memory-bound decode steps.
    
    Design: docs/plans/2026-03-09-roofline-llm-optimizer-port-design.md
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * config: update MFU values to llm-optimizer defaults (0.45/0.30)
    
    MfuPrefill: 0.65 → 0.45, MfuDecode: 0.12 → 0.30 for all GPU entries.
    These values match llm-optimizer's defaults which achieve 36.5% ITL MAPE
    on the sim-to-real evaluation (discussion #522).
    
    Other HardwareCalib fields (BwEffConstant, overheads) remain unchanged
    for backward compatibility — they are no longer used by rooflineStepTime()
    but may be consumed by other callers.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * docs: add roofline llm-optimizer port design and implementation plan
    
    Design doc: decision record for porting llm-optimizer's single-crossover
    roofline physics into BLIS.
    Implementation plan: 3 tasks (physics rewrite, MFU update, verification).
    
    Motivation: discussion #522 sim-to-real accuracy validation.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): load weights once per step in roofline (unified forward pass)
    
    vLLM chunked prefill processes all tokens (prefill + decode) in a single
    forward pass — weights are loaded from HBM once per step, not once per
    phase. The previous implementation loaded weights independently for
    prefill and decode phases, doubling the memory-bound term for mixed
    batches (~2x over-prediction).
    
    Sources: vLLM V1 blog ("all selected requests are flattened and
    concatenated into one long super-sequence for that single forward pass"),
    Sarathi-Serve OSDI'24 ("cost of loading model weights from HBM is
    amortized across all prompts in a batch").
    
    Adds TestRooflineStepTime_MixedBatch_WeightsLoadedOnce which verifies
    the overhead of adding prefill to a decode step is much less than a
    full weight load (7µs vs 4166µs).
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): use 2-matrix MLP in roofline FLOPs and weight calculation
    
    Change MLP factor from 3 (SwiGLU gate+up+down) to 2 (up+down) in both
    calculateTransformerFlops and calculateMemoryAccessBytes, matching
    llm-optimizer's formulation.
    
    For models like Llama-2-70B where IntermediateDim=28672, the 3-matrix
    formula produced 31% more MLP weight bytes than llm-optimizer's
    2-matrix formula, directly inflating memory-bound decode predictions.
    
    Applies to both dense and MoE paths (routed + shared expert FLOPs/weights).
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * config: bump MFU values to 0.55/0.35 to reduce roofline over-prediction
    
    MfuPrefill: 0.45 → 0.55 (reduces compute-bound prefill/TTFT predictions ~18%)
    MfuDecode: 0.30 → 0.35 (reduces near-crossover decode predictions ~14%)
    
    Motivation: after porting llm-optimizer single-crossover physics, BLIS
    roofline still over-predicts by ~50% MAPE. Higher MFU reflects observed
    H100 tensor core utilization for large prefill GEMMs and batched decode.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): restore SwiGLU 3-matrix MLP, revert MFU bump
    
    Revert MFU values to llm-optimizer defaults (0.45/0.30) — the bump
    to 0.55/0.35 went the wrong direction (both models under-predict).
    
    Restore 3-matrix MLP (gate + up + down) for SwiGLU, replacing the
    2-matrix formula copied from llm-optimizer. SwiGLU actually has 3
    weight matrices that all need HBM loading: this is the physically
    correct formula and increases weight bytes by ~37%, which reduces
    the under-prediction from ~50% toward the target.
    
    Dense and MoE paths both updated consistently (R23).
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * feat(latency): conditional SwiGLU detection via HiddenAct field
    
    Add mlpMatrixCount() helper that returns 3 for SwiGLU (silu/swiglu/geglu)
    or 2 for standard (gelu/relu) MLP. Parsed from HF config's hidden_act
    field. Empty defaults to SwiGLU since most modern LLMs use it.
    
    Both calculateTransformerFlops and calculateMemoryAccessBytes now use
    nMat instead of hardcoded 3, correctly handling non-SwiGLU models.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): revert to 2-matrix MLP convention matching llm-optimizer
    
    3-matrix with raw intermediate_size over-predicts for models like
    Llama2-70B whose intermediate_size (28672) exceeds the standard SwiGLU
    (2/3 × 4d) convention. Using nMat=2 matches llm-optimizer's approach
    where 2 × d × intermediate ≈ physical weight count for most models.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): remove MoE-specific branches from roofline step time
    
    Roofline now treats MoE models identically to dense (matching
    llm-optimizer which has no MoE-specific handling). MoE fields
    (NumLocalExperts, MoEExpertFFNDim, SharedExpertFFNDim) are still
    used by KV capacity (kv_capacity.go) for GPU memory budgeting.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): MoE roofline scales weights by E, FLOPs by top_k
    
    Mixtral was under-predicted by ~10x because the dense treatment loaded
    1 expert's MLP weights instead of all 8. Fix:
    - Weight bandwidth: E × MLP weights (all experts loaded from HBM per step)
    - FLOPs: top_k × MLP FLOPs (only active experts compute per token)
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): use MoEExpertFFNDim in roofline when set
    
    For DeepSeek-V3 style models where intermediate_size (18432) differs
    from per-expert dim (2048), use MoEExpertFFNDim for MoE weight and
    FLOP calculations. Falls back to IntermediateDim when unset (Mixtral).
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix(latency): address PR #561 review — revert crossmodel scope, fix docs
    
    - Revert crossmodel MoE threshold from > 1 back to > 0 (scope violation:
      crossmodel behavioral change doesn't belong in a roofline PR)
    - Fix design doc table and CLI comment claiming roofline models shared
      experts and gate FLOPs (it doesn't — only KV capacity does)
    - Fix HiddenAct comments that incorrectly claim it selects 3-matrix vs
      2-matrix MLP (mlpMatrixCount always returns 2)
    - Document intentional 2-matrix (roofline) vs 3-matrix (KV capacity)
      design choice with cross-references in both files
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 <[email protected]>
    Co-authored-by: Srinivasan Parthasarathy <[email protected]>
    
    * fix(sim): PR #567 follow-up — validation gaps, LengthCappedRequests counter, INV-9 extension (#580) (#587)
    
    - Add negative MaxModelLen validation in NewSimulator (BC-1: defense-in-depth for struct literal bypass)
    - Add LengthCappedRequests metric counter across 5-file pattern (BC-2, BC-3, BC-4)
    - Add end-to-end sim.Run() test for BC-5 runtime length cap path
    - Extend INV-9 structural test to scan sim/cluster/ control-plane files (BC-6)
    - Add negative MaxOutputLen validation in EnqueueRequest (BC-7: R3 gap)
    - Add gemma3 model_type exclusion for rope_scaling (BC-9: matches vLLM)
    - Add rope_scaling parse-failure warnings for malformed HF configs (BC-8)
    - Fix kvFeasibleMax comment accuracy (blockSizeTokens is configurable, not 16)
    
    Fixes #580
    
    Co-authored-by: Claude <[email protected]>
    
    * refactor(sim): remove dead HardwareCalib fields — BwEffConstant, TOverheadMicros, PerLayerOverhead, AllReduceLatency (#596)
    
    These fields became dead code after the roofline physics port (llm-optimizer
    single-crossover model). No runtime code path reads them; ValidateRooflineConfig
    enforced BwEffConstant > 0 on a value nothing consumed. Removing them eliminates
    config-file clutter and prevents future contributors from assuming they're active.
    
    Fixes #590
    
    Co-authored-by: Claude Opus 4.6 <[email protected]>
    
    * Configure claude on  GH Actions (#600)
    
    Signed-off-by: Jing Chen <[email protected]>
    
    * Enable claude on PRs (#601)
    
    Signed-off-by: Jing Chen <[email protected]>
    
    * ignore training and actions runner (#607)
    
    Signed-off-by: Srinivasan Parthasarathy <[email protected]>
    
    * fix(sim): PR #580 deferred items — rope_scaling extraction, MaxModelLen int64, tests, docs (#606)
    
    Complete 7 deferred hardening items from issue #580 (PR #587 handoff):
    
    1. Extract applyRopeScaling as a pure function with 26 table-driven test
       cases covering blacklist (su/longrope/llama3), mrope fall-through,
       gemma3 substring match (handles text_config pivot), yarn original base,
       overflow guards, NaN/Inf defense, degenerate inputs.
    
    2. Change MaxModelLen from int to int64 for consistency with ProgressIndex,
       TotalKVBlocks, BlockSizeTokens. Updates 6 type sites, removes redundant
       int64() casts, adds int64() widening at EnqueueRequest comparison sites.
    
    3. Add cluster-mode MaxModelLen drop test (BC-6): Guard 1a (input >= limit)
       and Guard 1b (input + budget > limit), INV-1 conservation, inFlightRequests
       drain, Metrics.Requests map cleanup.
    
    4. Add chunked prefill + MaxModelLen interaction test (BC-7): verifies no
       spurious force-completion during multi-chunk prefill (TotalOutputTokens=49,
       LengthCappedRequests=0, TTFT recorded).
    
    5. Add glossary entries for MaxModelLen and Oracle Knowledge Boundary (INV-9).
    
    6. Refine rope_scaling documentation with explicit blacklist details.
    
    7. Fix pre-existing gemma3 bug: ParseHFConfig's text_config pivot overwrites
       model_type from "gemma3" to "gemma3_text", making the exact-match check
       dead code. Changed to strings.Contains to match vLLM's substring semantics.
    
    Related to #580. Discovered issues: #602, #603, #604, #605.
    
    Co-authored-by: Claude <[email protected]>
    
    * feat(latency): add trained-roofline backend with roofline basis functions × learned corrections (#616)
    
    * feat(latency): register trained-roofline backend name (BC-1)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(latency): add PostDecodeFixedOverhead to interface + implement TrainedRooflineLatencyModel (BC-3,6,7,8,9,11,15)
    
    - Add PostDecodeFixedOverhead() int64 to LatencyModel interface
    - Existing backends (blackbox, roofline, crossmodel) return 0
    - Simulator recordRequestCompletion adds PostDecodeFixedOverhead to E2E
    - TrainedRooflineLatencyModel: 6 roofline basis functions with learned corrections
    - Zero heap allocations in StepTime (19ns/op, 0 allocs/op)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(latency): wire trained-roofline factory in NewLatencyModel (BC-2,13,14)
    
    - Full validation: TP, NumLayers, NumHeads, HiddenDim, IntermediateDim,
      TFlopsPeak, BwPeakTBs, NumHeads%TP, NumKVHeads%TP, 7 beta coefficients
    - Derives architecture features at construction: headDim, dKV, dFF, kEff
    - Table-driven error tests for all validation paths
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * test(latency): add monotonicity behavioral tests for trained-roofline (BC-4,5)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(latency): add trained-roofline defaults + CLI loading (BC-10,12)
    
    - Add trained_roofline_defaults section to defaults.yaml with 7 betas + 3 alphas
    - Add TrainedRooflineDefaults struct to cmd/default_config.go
    - CLI handling: 4 sites in cmd/root.go (loading block, zero-coefficients guard,
      HFConfig parsing, help text)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: add trained-roofline to latency model documentation
    
    - CLAUDE.md: "Four modes", file tree, Key Data Flow
    - sim/config.go: Backend field comment
    - sim/latency/latency.go: package doc
    - docs/concepts/core-engine.md: "four latency model backends"
    - docs/concepts/glossary.md: "Four modes" + trained-roofline description
    - Plan committed alongside implementation
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(sim): guard PostDecodeFixedOverhead for zero-output requests + fix ITL contamination
    
    - PostDecodeFixedOverhead only applied when len(OutputTokens) > 0
    - RequestITLs computed from itlSum directly (not lat-FirstTokenTime) to
      avoid contaminating per-token average ITL with fixed overhead
    - Add zero-alpha warning for trained-roofline CLI path
    
    Caught by code review Step 4.5.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs(guide): comprehensive trained-roofline section in latency models guide
    
    - Add trained-roofline section with formula, alpha model, accuracy caveats
    - Update comparison table to 4 backends
    - Update recommendation: trained-roofline is now the default for new models
    - Update pluggable architecture to show 4 interface methods
    - Fix cross-model description accuracy
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix: convergence Round 1 fixes — extension recipe, slice copy, zero-alloc test, config ref
    
    - Extension recipe: 3→4 methods, added bundle.go + CLI wiring touch points,
      added trained-roofline as 4th example
    - Factory: defensive copy of beta/alpha slices to enforce "frozen" contract
    - Test: add TestTrainedRoofline_StepTime_ZeroAllocs using testing.AllocsPerRun
    - Configuration reference: add trained-roofline to --latency-model flag description
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: add trained-roofline to quickstart + document non-blocking overhead pattern
    
    - Quickstart: add trained-roofline example (recommended for new models)
    - recordRequestCompletion: document that E2E includes non-blocking
      PostDecodeFixedOverhead and OutputTokenProcessingTime, explaining why
      RequestCompletionTimes exceeds RequestLeftEvent timestamp
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: self-audit — update models.md, roofline.md, tutorial.md for trained-roofline
    
    All documentation working copies now mention trained-roofline consistently.
    Source-of-truth map: 12/12 working copies updated.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim): populate MaxOutputLen on all workload paths + engine auto-fill (#621)
    
    * feat(sim): add MaxOutputLen auto-fill in EnqueueRequest (BC-1..BC-4)
    
    - Auto-fill MaxOutputLen = maxModelLen - len(InputTokens) when client
      omits budget (MaxOutputLen==0) and maxModelLen > 0
    - Mirrors vLLM input_processor.py:554 safety cap
    - No auto-fill when client sets budget (BC-2), unlimited mode (BC-3),
      or input exceeds context (BC-4)
    
    Refs: #572
    
    Co-Authored-By: Claude <[email protected]>
    
    * feat(workload): set MaxOutputLen on all request construction sites (BC-5..BC-7)
    
    - generator.go: MaxOutputLen = len(outputTokens) (synthetic/multimodal)
    - replay.go: MaxOutputLen = len(outputTokens) (trace v2 replay)
    - reasoning.go: MaxOutputLen = len(outputTokens) (multi-turn reasoning)
    - Matches inference-perf pattern: max_tokens = sampled output length
    
    Fixes #572
    
    Co-Authored-By: Claude <[email protected]>
    
    * docs(sim): update EnqueueRequest doc comment for auto-fill preprocessing
    
    Co-Authored-By: Claude <[email protected]>
    
    * docs(test): update stale MaxOutputLen=0 comments for auto-fill semantics
    
    - Three tests referenced 'input-only check' for MaxOutputLen=0
    - After auto-fill, MaxOutputLen is set to maxModelLen - input
    - Tests still pass numerically; comments now reflect actual behavior
    
    Co-Authored-By: Claude <[email protected]>
    
    ---------
    
    Co-authored-by: Claude <[email protected]>
    
    * docs: switch default example model to public Qwen/Qwen3-14B (#608)
    
    * docs: switch default example model to public qwen/qwen2.5-7b-instruct
    
    Replace gated meta-llama/llama-3.1-8b-instruct with publicly available
    qwen/qwen2.5-7b-instruct in all user-facing docs (README, quickstart,
    tutorial, guides, reference, CLAUDE.md, CONTRIBUTING.md). Roofline/crossmodel
    examples now work without HF authentication.
    
    Set qwen default TP=1 in defaults.yaml so examples use the default without
    explicit --tp flags. Update KV block count, coefficient examples, and prose
    references to match TP=1 values.
    
    Fixes #545
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * chore(defaults): update vllm version to v0.11.0 for 4 models (H100 TP=1)
    
    Update default and trained-coefficient vllm_version for
    qwen2.5-7b-instruct, qwen3-14b, llama-3.1-8b-instruct, and
    qwen2.5-3b-instruct to vllm/vllm-openai:v0.11.0.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * docs: switch default example model from qwen2.5-7b to qwen3-14b
    
    Qwen3-14B (Qwen/Qwen3-14B) is a newer, publicly available model with
    pre-trained coefficients already in defaults.yaml. Update all
    documentation examples and references accordingly.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * fix: address review comments — stale refs, tutorial throughput
    
    - Fix "LLaMA 3.1 8B" comment in experimentation.md (issue #3)
    - Update stale llama-3.1-8b/132,139 refs in configuration.md (issue #4)
    - Recalibrate tutorial for qwen3-14b throughput: ~2.5 req/s per
      instance, target 20 req/s (was 57 req/s / 500 req/s for llama)
    - Scale experimentation.md example to match (20 req/s, not 400)
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    * docs: add HF_TOKEN tip to quickstart and README for gated models
    
    Roofline/trained-roofline/crossmodel modes auto-fetch from HuggingFace,
    which fails for gated models without authentication. Add a lightweight
    tip after the first roofline example in both files recommending HF_TOKEN
    for gated model access and rate limit avoidance.
    
    Co-Authored-By: Claude Opus 4.6 <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 <[email protected]>
    
    ---------
    
    Signed-off-by: Jing Chen <[email protected]>
    Signed-off-by: Srinivasan Parthasarathy <[email protected]>
    Co-authored-by: Srinivasan Parthasarathy <[email protected]>
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    Co-authored-by: Dipanwita Guhathakurta <[email protected]>
    Co-authored-by: Jing Chen <[email protected]>
    5 people authored Mar 12, 2026
    Configuration menu
    Copy the full SHA
    ec8a48e View commit details
    Browse the repository at this point in the history
  4. fix(sim): guard decode-only batch path against zero-input requests (#628

    )
    
    The PD decode-only path in VLLMBatchFormation.FormBatch() used the
    condition `ProgressIndex >= inputLen` to detect decode sub-requests
    with pre-allocated KV. For zero-input requests (len(InputTokens)==0),
    this condition is satisfied since ProgressIndex(0) >= inputLen(0),
    causing non-PD requests to incorrectly take the decode-only path.
    
    Add `ProgressIndex > 0` guard so the path only fires for PD decode
    sub-requests where AllocateTransferredKV explicitly set ProgressIndex
    to inputLen (which is > 0 for real requests). This ensures the pd
    branch is a fully transparent drop-in replacement for non-PD users.
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 12, 2026
    Configuration menu
    Copy the full SHA
    1172a5a View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2026

  1. feat(sim/cluster): per-pool hardware configuration (PD Phase 2, PR1) (#…

    …637)
    
    * feat(sim/cluster): add PoolOverrides type and ResolvePoolConfig function
    
    Pure config resolver for per-pool hardware overrides (BC-P2-1, BC-P2-2).
    Pointer types for TP/MaxModelLen/TotalKVBlocks (R9).
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): add per-pool override fields and resolveConfigForRole
    
    DeploymentConfig gains PrefillOverrides/DecodeOverrides (PoolOverrides).
    resolveConfigForRole dispatches to ResolvePoolConfig per role.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): add BuildPoolMembershipFromIndices for pre-construction use
    
    Index-based variant of BuildPoolMembership that generates instance IDs
    using the same naming convention without requiring constructed instances.
    Existing function retained for backward compatibility (BC-P2-5).
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): per-pool config in NewClusterSimulator instance construction
    
    Refactor instance construction to resolve per-pool config using
    resolveConfigForRole. Pool membership computed from indices before
    instances (INV-P2-1). Reuses prePoolMembership to avoid redundant
    BuildPoolMembership call.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(cmd): add per-pool CLI flags for hardware overrides
    
    8 new flags: --prefill-tp, --decode-tp, --prefill-hardware,
    --decode-hardware, --prefill-latency-model, --decode-latency-model,
    --prefill-max-model-len, --decode-max-model-len.
    Uses cmd.Flags().Changed() for R18 flag precedence.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(cmd): per-pool KV auto-calculation for analytical backends
    
    When per-pool TP/GPU differs from global and analytical backend is active,
    CalculateKVBlocks is called per-pool with pool-specific parameters (BC-P2-4).
    Override construction moved before analytical backend block for scope access.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * test(sim/cluster): INV-P2-1 invariant test and heterogeneous cluster helper
    
    Verifies pool-config consistency with heterogeneous KV capacity.
    Adds newHeterogeneousDeploymentConfig helper for future PR consumption.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: update CLAUDE.md with per-pool config files and CLI flags
    
    Add resolve.go to file organization, per-pool CLI flags to root.go
    description, INV-P2-1 to invariants section, per-pool hardware flags
    to disaggregated data flow section.
    
    Part of #633
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix: address convergence review findings for per-pool hardware config
    
    - R3: validate per-pool TP (>0), MaxModelLen (>0), and latency backend
      names at CLI boundary before passing to cluster construction
    - Fix comment typo: decodeLatencyModel was labeled "prefill pool"
    - Move ValidatePoolTopology before BuildPoolMembershipFromIndices to
      fail fast before instance allocation
    - Warn when per-pool flags are set but PD mode is not active
    - Enhance TestINV_P2_1 to verify per-instance KV capacity via observable
      FreeKVBlocks() before simulation (not just post-sim completion)
    - Add TestResolvePoolConfig_Idempotent as R7 companion invariant test
    - Document struct-copy safety and latency backend constraint in resolve.go
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: address convergence review findings for per-pool hardware config
    
    - Add INV-P2-1 (Pool-Config Consistency) to canonical invariants.md
    - Add Per-Pool Hardware Overrides section to configuration.md with flag
      table, KV auto-calc explanation, known limitation, and CLI example
    - Update CLI flag summary table to include all 8 per-pool flags
    - Fix stale "homogeneous instances" note in cluster.md guide
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix: address PR review findings — error returns, stale comments, R23/R11 guards
    
    - C1/C2: Run() now returns error on INV-PD-3 violation instead of warn-and-continue;
      negative inFlightRequests logged at Error level
    - C3: DeploymentConfig docstring updated to reflect per-pool overrides
    - C4: NewLatencyModel docstring includes trained-roofline; stale backend
      enumerations and priority comments fixed
    - I1: MoE threshold in trained-roofline factory changed from > 0 to > 1 (R23 parity)
    - I5: KVUtilization() division guard added (R11)
    - I12-I14: Fixed misleading DisaggregationDecisionEvent ordering comment,
      incorrect R9 rationale, and wrong ParentRequest distinction method
    - S7: Compile-time DisaggregationObserver interface check for PrefixThresholdDecider
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix: R23 crossmodel MoE threshold parity (> 0 → > 1)
    
    Crossmodel backend used NumLocalExperts > 0 for MoE detection while
    roofline, trained-roofline, and kv_capacity all used > 1. Single-expert
    models (NumLocalExperts=1) are dense-equivalent and should not trigger
    MoE dispatch overhead in step time estimation.
    
    Align crossmodel with all other backends and update stale docs reference
    in docs/guide/latency-models.md.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 13, 2026
    Configuration menu
    Copy the full SHA
    b123658 View commit details
    Browse the repository at this point in the history
  2. feat(sim/cluster): KV transfer contention model (PD Phase 2, PR2) (#639)

    * feat(sim/cluster): KV transfer contention model (PD Phase 2, PR2)
    
    Model shared-bandwidth effects when multiple KV transfers overlap.
    When --pd-transfer-contention is enabled, each concurrent transfer
    receives fair-share bandwidth: effective_bw = total_bw / max(1, N).
    
    - Active transfer counter (increment on start, decrement on completion)
    - Fair-share bandwidth division in KVTransferStartedEvent.Execute()
    - CLI flag --pd-transfer-contention (bool, default false)
    - INV-P2-2 invariant: effective_bandwidth = total_bandwidth / max(1, active)
    - Contention metrics: PeakConcurrentTransfers, MeanTransferQueueDepth
    - BC-P2-5: single transfer identical to Phase 1
    - BC-P2-7: INV-PD-3 (transfer conservation) still holds
    - Backward compatible: feature off by default
    
    Closes #634
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs(standards): add INV-P2-2 to canonical invariants.md
    
    Convergence review (Round 1) identified that INV-P2-2 (transfer
    fair-share) was added to CLAUDE.md but not to the canonical source
    docs/contributing/standards/invariants.md. This fixes the DRY
    violation by adding the full invariant definition with statement,
    verification references, mechanism, and hypothesis family.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address PR review findings for KV transfer contention model
    
    C1: add activeTransfers underflow guard in KVTransferCompletedEvent.Execute()
        matching the inFlightRequests pattern (logrus.Errorf + clamp-to-zero, R1)
    
    C2: add post-simulation diagnostic when activeTransfers != 0 at Run() exit,
        making horizon-truncated contention state visible to operators
    
    I1: fix INV-P2-2 formula description — replace max(1, active_transfers) with
        accurate two-case description in pd_events.go, invariants.md, and CLAUDE.md
    
    I2: document split ownership in CollectPDMetrics — PeakConcurrentTransfers and
        MeanTransferQueueDepth must be attached by callers after CollectPDMetrics
    
    I3: change transferDepthSum/transferStartCount to int64 to prevent silent
        arithmetic overflow on 32-bit platforms
    
    I4: remove stale "Phase 2, PR2" locator comments (R15) — replace with stable
        functional descriptions referencing INV-P2-2 and the flag name
    
    S1: replace formula unit test with behavioral test that drives actual
        ClusterSimulator and measures observed TransferCompleteTime - TransferStartTime
    
    S2: change t.Skipf to t.Fatalf in BCP26_FairShareDivision — concurrent transfers
        must occur; skip silently hid coverage gaps
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address convergence review findings for transfer contention model
    
    - invariants.md: fix stale Verification line for INV-P2-2 (claimed table-driven
      N=1,2,4 but test only verifies N=1; N>1 behavioral coverage is via BCP26)
    - deployment.go: remove stale PR1/PR2/Phase2 references from field group comments
    - cluster.go: fix activeTransfers warning comment to accurately describe when
      the condition is reachable (only via prior negative-guard correction, not
      horizon truncation as the old comment incorrectly stated)
    - transfer_contention_test.go: refactor MeanQueueDepthZeroTransfers from
      ClusterSimulator struct literal to production integration path (NewClusterSimulator
      + mustRun with 0 requests); add explanatory comment to MeanQueueDepthCalculation
    - docs/reference/configuration.md: add --pd-transfer-contention to PD flags
      table; clarify --pd-transfer-bandwidth "shared global fabric" semantics
    - docs/guide/results.md: document PeakConcurrentTransfers and MeanTransferQueueDepth
      metrics with accurate descriptions (including NOT-a-queue-depth caveat)
    - docs/guide/cluster.md: add "when to enable --pd-transfer-contention" guidance
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address PR review findings for transfer contention model
    
    Critical fixes:
    - R3: add int64 overflow guard for BlockSizeTokens * PDKVBytesPerToken in
      NewClusterSimulator; overflow previously silently clamped transfer duration
      to 1 µs for pathologically large parameter combinations
    - Clarify activeTransfers warning comment: INV-PD-3 catches horizon truncation
      first, so the post-run warning fires only on undetected bookkeeping imbalance
    
    Important fixes:
    - Add contentionBookkeepingCorrupted bool field; set by negative guard in
      KVTransferCompletedEvent; Run() returns error instead of delivering silently
      invalid contention metrics to callers
    - Add TestTransferContention_INVP22_N2FormulaExact: direct formula verification
      for N=2 (17 µs), complementing the existing N=1 test (9 µs), covering the
      increment-before-calculate ordering invariant
    - Add TestTransferContention_NegativeGuard_SetsCorruptionFlag: exercises the
      negative-guard path via in-package state manipulation, verifying the flag is
      set and activeTransfers is reset to 0
    - Strengthen TestTransferContention_BCP26_FairShareDivision assertion from
      >= mean-1 (lower-bound only) to > mean (strict), so the test fails if the
      contention branch is dead code
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address convergence review findings for PD transfer model
    
    - fix(pd_events): use float64 arithmetic for transfer bytes to eliminate
      int64 silent overflow (R11 — numBlocks * blockSizeBytes could wrap for
      extreme configs; float64 handles any realistic block count safely)
    - fix(docs): replace remaining meta-llama examples with qwen/qwen3-14b
      in latency-models.md, cluster.md, configuration.md (4 instances)
    - feat(docs): add Context Window Enforcement section to cluster.md
      documenting --max-model-len, auto-derivation, and KV-feasible capping
    - fix(docs): add DroppedAtDecodeKV row to results.md PD metrics table,
      cross-referencing stdout label vs struct field name
    - fix(test): add R7 companion invariant test TestPrintPDMetrics_Invariant
      (nil/non-nil duality law) in cmd/kv_metrics_output_test.go
    - fix(test): add R7 companion invariant test TestCollectPDMetrics_Invariant
      (LoadImbalanceRatio ≥ 1.0 law) in sim/cluster/pd_metrics_test.go
    - docs(resolve): document PoolOverrides pointer field contract for library
      callers constructing DeploymentConfig directly
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address PR review findings for PD transfer contention model
    
    - R3: add NaN/Inf guards to NewClusterSimulator for PDTransferBandwidthGBps
      and PDTransferBaseLatencyMs (library-level dual validation)
    - Compound contentionBookkeepingCorrupted into INV-PD-3 error when both
      conditions are true (prevents swallowing corruption signal)
    - Add logrus.Errorf in zero-bandwidth fallback branch (R1: no silent drops)
    - Add comment on contention/non-contention activeTransfers asymmetry
    - Test: end-to-end corruption flag → Run() error path
    - Test: duration floor for 0-block transfers (1 µs minimum)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address convergence review findings for PD transfer contention
    
    F1: Reject --pd-transfer-contention when PD disaggregation is not active (R3).
    F2: Always print contention metrics when feature is enabled, even if zero.
    F3: Add R7 companion invariant tests for formula golden tests (divisor law,
        duration floor property, monotonicity).
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(cmd): address convergence review findings for PD transfer contention
    
    - Add logrus.Warnf when max_position_embeddings is absent/zero in HF
      config during analytical backend auto-derivation; previously silent
    - Add NaN/Inf validation for trained-roofline beta/alpha coefficient
      arrays loaded from defaults.yaml (R20: allZeros passes NaN)
    - Add TestRunCmd_MaxModelLen_FlagRegistered flag registration test
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * docs(guide): document cluster-wide bandwidth pool assumption for PD transfer contention
    
    Add a "Single shared bandwidth pool" admonition to the --pd-transfer-contention
    guidance in docs/guide/cluster.md. Clarifies that all concurrent KV transfers share
    one cluster-wide bandwidth budget regardless of which prefill/decode instance pair
    is involved, and advises users to set --pd-transfer-bandwidth to the aggregate shared
    capacity (not per-NIC bandwidth) for accurate contention modeling.
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 13, 2026
    Configuration menu
    Copy the full SHA
    ffe4540 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2026

  1. feat(sim/cluster): prefill-decode interference model (PD Phase 2, PR3) (

    #647)
    
    * docs: add design spec for PD interference model (#635)
    
    Specifies the InterferenceLatencyModel wrapper that applies multiplicative
    slowdown to StepTime() based on batch phase composition for break-even
    analysis between disaggregation transfer cost and co-location interference.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: add implementation plan for PD interference model (#635)
    
    5-task plan: wrapper + tests, injection plumbing, CLI flags,
    integration test, documentation updates. TDD throughout.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): add InterferenceLatencyModel wrapper (#635)
    
    Tier composition wrapper that applies multiplicative slowdown to StepTime()
    based on batch phase composition. Satisfies BC-P2-9 through BC-P2-12 and
    INV-P2-3 (multiplier >= 1.0).
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): wire interference model into instance construction (#635)
    
    Add PDInterferencePrefill/Decode to DeploymentConfig. Extract
    newInstanceSimulatorCore to wrap latency model when factors are non-zero.
    Public NewInstanceSimulator API unchanged (R4: 23 test call sites unaffected).
    
    Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    
    * feat(cmd): add --pd-interference-prefill/decode CLI flags (#635)
    
    Wire interference factors through CLI → DeploymentConfig → instance construction.
    Validated as finite non-negative (R3). Default 0 = no interference (BC-P2-9).
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * test(sim/cluster): add cluster integration test for interference model (#635)
    
    Verifies that non-zero interference factors produce longer simulation times
    and per-request E2E latencies compared to the zero-interference baseline.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: update CLAUDE.md and design guidelines for interference model (#635)
    
    Add INV-P2-3 (interference monotonicity), CLI flags, file organization
    entry, and module map entry for InterferenceLatencyModel.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): convergence review fixes for interference model (#635)
    
    - Strengthen TestNewInstanceSimulatorCore_WrapsLatencyModel with behavioral
      assertions (mixed batch slows down, phase-pure unchanged, INV-P2-3 holds)
    - Add R3 validation for PDInterferencePrefill/Decode in NewClusterSimulator
    - Add MaxInterferenceFactor=100.0 exported constant; add upper-bound validation
      at all three layers (CLI, cluster constructor, factory) to prevent silent
      int64 overflow on degenerate inputs (R20)
    - Use single exported constant from all validation sites (DRY — no drift risk)
    - Add tie-break comment explaining conservative max-factor choice at equal split
    - Expand interference.go doc to document no-op in PD disaggregated mode
    - Update CLI help text with formula example (factor=0.5 → 1.25x at even split)
    - Add "Co-Location Interference Model" section to docs/guide/cluster.md
    - Add --pd-interference-prefill/decode rows to docs/reference/configuration.md
    - Add PD disaggregation mention to README features list
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): address PR review findings for interference model
    
    - Move interference factor validation before instance construction in
      NewClusterSimulator so the authored error messages are reachable and
      no partial allocation occurs on invalid configs (Critical)
    
    - Add TestNewClusterSimulator_InvalidInterferenceFactors_Panics with
      8 table cases (negative, NaN, ±Inf, above-max for each field) to
      verify ClusterSimulator panic messages fire before any allocation
    
    - Fix "no-op at zero" comment in newInstanceSimulatorCore to accurately
      describe || semantics: "no-op only when both are zero"
    
    - Add 4 asymmetric-factor cases to TestInterferenceLatencyModel_StepTime
      verifying one-factor-zero behavior for each dominant phase
    
    - Add TestInterferenceModel_ClusterIntegration_INV_P2_3 as R7 invariant
      companion: verifies SimEndedTime and per-request E2E are non-decreasing
      under interference, with non-vacuity assertion
    
    - Fix R20 → R3 misattribution on MaxInterferenceFactor and
      NewInterferenceLatencyModel (R20 is for anomaly detectors; these are
      numeric parameter range guards, which is R3)
    
    - Fix "at most 51×" → "exactly 51×" on MaxInterferenceFactor comment
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): convergence review fixes for interference model (#647)
    
    - Add non-vacuity guard to TestInterferenceModel_ClusterIntegration golden
      test (prevents vacuous pass when no requests complete)
    - Add INV-PD-1 defensive runtime check in DecodeRoutingEvent.Execute()
      (detects decode_enqueue < transfer_complete on priority-ordering regression)
    - Document request-count vs token-count approximation in computeMultiplier
      (calibration guidance for heterogeneous workloads)
    - Document intentional parentRequests map retention in cluster.go
      (clarifies never-pruned-by-design, bounded at <100K for typical sims)
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * fix(sim/cluster): apply PR review fixes for interference model
    
    - Fix MaxInterferenceFactor comment: int64 overflow (not float64)
    - Fix cluster.go comment: MaxInt64 (not MaxFloat64)
    - Add R20 warning in NewClusterSimulator when interference factors are
      non-zero but deployment is fully disaggregated (no-op scenario)
    - Fix INV-PD-2 doc: qualify phase-purity claim for PrefixThresholdDecider
    - Fix StepTime doc: int64(len(InputTokens)) matches util.Len64 usage
    - Fix approximation note: use decode-dominant example to match label
    - Add overflow guard in StepTime before result<1 clamp (R1)
    - Add compile-time sim.LatencyModel interface assertion
    - Fix makeBatch: decode ProgressIndex=15 (>len, not just ==len boundary)
    - Add tied-split symmetry test case (reversed factors)
    - Add LastAppliedMultiplier atomicity test (3 consecutive calls)
    - Add PD mode no-op cluster test: BC-P2-10 at cluster level
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 14, 2026
    Configuration menu
    Copy the full SHA
    6cb9a41 View commit details
    Browse the repository at this point in the history
  2. feat: direct-to-decode decider for short-prompt bypass (PD Phase 2, P…

    …R4) (#650)
    
    * feat(sim): add DirectToDecodeDecider for short-prompt bypass
    
    Routes short prompts (len(InputTokens) < threshold) directly to the
    decode pool, skipping disaggregation. Long prompts continue through
    the full prefill→transfer→decode pipeline.
    
    BC-P2-18: len(InputTokens) < threshold → Disaggregate=false
    BC-P2-19: empty input → Disaggregate=false
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim): register direct-to-decode in decider bundle
    
    Adds "direct-to-decode" to validDisaggregationDeciders map and adds
    factory panic case directing callers to use NewDirectToDecodeDecider.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): wire DirectToDecodeDecider in cluster constructor
    
    Adds PDDirectDecodeThreshold field to DeploymentConfig, switch-based
    decider construction in NewClusterSimulator, and updates interference
    warning to only fire for fully-disaggregated deployments (decider=always).
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(sim/cluster): pool-filtered routing for non-disaggregated requests (INV-P2-4a)
    
    When pools are configured and a request is not disaggregated, route it
    to the decode pool only via a new poolFilter field on RoutingDecisionEvent.
    Decode instances handle both phases with interference cost from PR3.
    
    BC-P2-14: non-disaggregated + pools → decode pool only
    BC-P2-15: non-disaggregated → no ParentRequest records
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * feat(cmd): add --pd-direct-decode-threshold CLI flag
    
    Registers the --pd-direct-decode-threshold flag (default 256) and wires
    it to DeploymentConfig. Includes validation (>= 0) and stale-flag
    warning when used with a different decider.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * test(sim/cluster): mixed-workload integration test for direct-to-decode
    
    Verifies short prompts route directly to decode pool while long prompts
    go through full PD pipeline (BC-P2-14, BC-P2-15, BC-P2-16, INV-1).
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * test(sim/cluster): invariant and backward-compat tests for direct-to-decode
    
    - INV-P2-4a: non-disaggregated + pools → decode pool (tested with never decider)
    - INV-P2-4b/BC-P2-17: interference applied to mixed-phase decode batches
    - INV-6: determinism for mixed short/long workloads
    - BC-P2-13: always-disaggregate behavior unchanged by pool filter
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: update CLAUDE.md for direct-to-decode decider (PR4)
    
    - Add --pd-direct-decode-threshold to CLI flags listing
    - Add DirectToDecodeDecider to disaggregation.go file description
    - Add PDDirectDecodeThreshold to DeploymentConfig description
    - Add INV-P2-4 (decode-targeted routing) to invariants section
    - Update disaggregated data flow with direct-to-decode decider option
    - Update local path to show decode pool routing (INV-P2-4)
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * docs: add INV-P2-4, direct-to-decode config reference, and design guidelines update
    
    - invariants.md: add INV-P2-3 (interference monotonicity) and INV-P2-4
      (decode-targeted routing) with verification strategies
    - configuration.md: add direct-to-decode decider to table, semantics
      section, and --pd-direct-decode-threshold flag
    - design-guidelines.md: add DisaggregationDecider to Section 4.2 module
      map with all four variants
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
    
    * review: fix test quality, snapshot efficiency, and docstring gaps
    
    - TestDirectToDecodeDecider_ClusterConstruction: replace nil check with
      behavioral assertion (decode pool routing) per BDD/TDD principles
    - TestDirectToDecodeDecider_MixedWorkload: tie INV-1 check to request
      slice lengths instead of hardcoded constant; clarify sub-request
      vs parent-request counting in comment
    - buildPoolFilteredSnapshots: only construct snapshots for target pool
      members; eliminates O(N/2) wasted Snapshot() calls per routing event
    - DisaggregationDecisionEvent.Execute: document no-pools fallback path
    - DirectToDecodeDecider.Decide: add empty-input note to method docstring
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * review: safety guards, PoolOverrides validation, and test coverage gaps
    
    Addresses findings from comprehensive PR review:
    
    Safety (C1, C3, C6, C8):
    - C1: Add PDTransferBaseLatencyMs upper-bound guard (<=3.6e9 ms) in
      NewClusterSimulator to prevent int64 overflow in transfer duration
      calculation (silent 1 µs clamp bug)
    - C3: Add PoolOverrides.Validate(name) method enforcing *TP>0,
      *MaxModelLen>0, *TotalKVBlocks>0; call from NewClusterSimulator
      before instance construction (R3)
    - C6: Add BlockSizeTokens>0 check in NewClusterSimulator PD block so
      error appears at construction time, not as a panic inside Run() (R6)
    - C8: Add clarifying comments above unreachable panic sites in
      newInstanceSimulatorCore explaining why they're safe and what to do
      if the function is ever called outside NewClusterSimulator
    
    Documentation (C4, C5):
    - C4: Update NewDisaggregationDecider docstring to include
      "direct-to-decode" in panic description
    - C5: Update PDDecider field comment to include "direct-to-decode"
      in the valid-values enum
    
    Test coverage (GAP-1, GAP-2, GAP-3):
    - GAP-1: Add INV-P2-4 pool-membership assertion to
      TestPrefixThreshold_HighThresholdNoDisaggregation
    - GAP-2: Add TestCollectPDMetrics_DroppedAtDecodeKV table-driven unit
      test for the TransferCompleteTime>0 && DecodeInstanceID=="" condition
    - GAP-3: Add INV-P2-4 pool-membership assertion to
      TestDisaggregationDecisionEvent_SchedulesRouting
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * review: convergence review fixes — O(K) detection, docs, and test coverage
    
    Performance: detectPrefillCompletions/detectDecodeCompletions O(N×M) → O(K)
    via pendingPrefillByInstance/pendingDecodeByInstance per-instance indexing.
    Pending maps cleared at simulation end to release memory for dropped/horizon-
    truncated requests.
    
    Validation: warn when prefill+decode < total instances (unassigned idle).
    
    Tests: TestDisaggregation_MaxModelLen_DropsOversizedRequests — exact drop count,
    CompletedRequests verification, and INV-1 conservation check.
    
    Docs: length_capped_requests anomaly counter; README "Four latency modes";
    quickstart PD examples; tutorial Step 8 PD break-even walkthrough;
    extension-recipes tier-composition pattern; pd-disaggregation-demo.yaml.
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * review: nil guard, godoc hardening, validate tests, and boundary coverage
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * review: convergence review fixes — per-pool maxModelLen cap, O(K) snapshot, contention accuracy, and docs
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    * review: nil guard, spurious warning fix, dead comment, topology guard, and warning accuracy
    
    - sim/disaggregation.go: add nil guard in DirectToDecodeDecider.Decide (panics with
      named message instead of NPE deep in event loop)
    - sim/cluster/cluster_event.go: fix spurious "unhandled pool role" warning — known
      roles without custom policies fall through silently; warning reserved for
      unrecognized roles only
    - sim/cluster/cluster_event.go: replace dead-code-path comment "When pools are NOT
      configured" with accurate "Defensive: always true here" note
    - cmd/root.go: fix misleading zero-threshold warning — qualifies "for non-empty
      inputs" and notes the empty-input exception
    - cmd/root.go: fix topology guard (&&→||) to catch the one-pool-zero case with the
      friendly "requires both" message
    
    Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    namasl and claude authored Mar 14, 2026
    Configuration menu
    Copy the full SHA
    ce68bd4 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2026

  1. fix(sim): guard processCompletions against ProgressIndex overshoot in…

    … PD mode (#687)
    
    In PD disaggregation, a decode sub-request with 1 output token enters
    with ProgressIndex == inputLen. After one decode step, ProgressIndex
    becomes inputLen+1, which overshot the == completion check and allowed
    a second erroneous decode step that panicked in AllocateKVBlocks with
    index out of range.
    
    Two fixes:
    - Use >= instead of == in completion check (catches overshoot)
    - Add bounds guard on final-token AllocateKVBlocks (prevents OOB access)
    
    Adds regression test TestPDDisagg_OneOutputToken_NoPanic.
    
    Co-authored-by: Claude Sonnet 4.6 <[email protected]>
    namasl and claude authored Mar 16, 2026
    Configuration menu
    Copy the full SHA
    3ac11fb View commit details
    Browse the repository at this point in the history
Loading