fix(autoscaler): termination guard for request-bounded runs (Bug 2) by vishakha-ramani · Pull Request #1024 · inference-sim/inference-sim

vishakha-ramani · 2026-04-10T22:27:17Z

Summary

Fixes #1022 — autoscaler-enabled blis run with any request-bounded workload hung indefinitely at 100%+ CPU.

Part of tracking issue #994.

Merge order: This PR should land first. PR for Bug 3 (#1023) is stacked on top of this branch.

What this PR contains

Wiring (prerequisite, shared with other bug PRs)

Three commits that wire the autoscaler pipeline into the blis run CLI:

feat(bundle): add node_pools and autoscaler sections to PolicyBundle
feat(cluster): wire default autoscaler pipeline in NewClusterSimulator
feat(cmd): wire --model-autoscaler-interval-us flag and policy-config → cluster config translation

Bug 2 fix

scheduleNextTick in sim/cluster/autoscaler.go unconditionally pushed the next ScalingTickEvent at nowUs + intervalUs. On request-bounded runs (Horizon == math.MaxInt64), the DES loop terminates only when the event queue is empty — but ScalingTickEvent kept the queue alive forever.

Fix: Stop self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0.

New field pendingArrivals int on ClusterSimulator tracks the count of ClusterArrivalEvents in the queue at any moment:

Incremented in Run() per initial arrival push
Decremented in ClusterArrivalEvent.Execute()

Note: This fix alone is sufficient for workloads without batch/background SLO-class requests. The companion Bug 3 PR handles the case where promoteDeferred() corrupts the counter for batch workloads.

Wiring improvements (reviewed once here, shared basis for Bug 3 and Bug 1 PRs)

bundle.Validate(): add range validation for AnalyzerBundleConfig thresholds — catches invalid scale_up_threshold, scale_down_boundary, avg_input_tokens, kv_cache_threshold at the CLI boundary (R3) before they reach NewV2SaturationAnalyzer panics
resolvePolicies() refactored to return (*PolicyBundle, []ScorerConfig) — eliminates redundant second LoadPolicyBundle call in runCmd.RunE
.gitignore: add docs/superpowers/ entry

Test

TestAutoscaler_RequestBoundedRun_Terminates — creates a cluster with ModelAutoscalerIntervalUs = 100_000, runs 10 requests, asserts Run() completes within 2s (goroutine + time.After guard).

TestPolicyBundle_Validate_AnalyzerThresholds — 9-case table test for analyzer config validation.

Test plan

go test ./sim/... ./cmd/... passes
go build ./... succeeds
TestAutoscaler_RequestBoundedRun_Terminates passes and completes in < 2s
TestPolicyBundle_Validate_AnalyzerThresholds — all 9 cases pass

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add AutoscalerAnalyzerConfig to DeploymentConfig and effectiveAnalyzerConfig() helper for WVA reference defaults. NewClusterSimulator now wires DefaultCollector+V2SaturationAnalyzer+UnlimitedEngine+DirectActuator when ModelAutoscalerIntervalUs > 0, so cmd/ only sets config fields. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

… bundle conversion Reads node_pools and autoscaler sections from --policy-config YAML and converts them into DeploymentConfig. CLI flag --model-autoscaler-interval-us overrides the YAML value when explicitly set. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…improvements Bug 2 — scheduleNextTick had no termination guard (autoscaler.go): ScalingTickEvent unconditionally rescheduled itself. On request-bounded runs (Horizon == MaxInt64), the cluster event queue never drained → infinite event loop at 100%+ CPU. Added guard: stop self-scheduling when Horizon == MaxInt64 && pendingArrivals == 0 && inFlight == 0. Added pendingArrivals int field to ClusterSimulator, incremented in Run() per arrival push, decremented in ClusterArrivalEvent.Execute(). Invariant: pendingArrivals == count of ClusterArrivalEvents in event queue. Closes inference-sim#1022 Also includes: - AnalyzerBundleConfig range validation in bundle.Validate() (R3: catch invalid thresholds at CLI boundary before NewV2SaturationAnalyzer panics) - resolvePolicies() refactored to return (*PolicyBundle, []ScorerConfig) eliminating redundant second LoadPolicyBundle call in runCmd.RunE - .gitignore: add docs/superpowers/ entry (agent-local artifacts) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Fix comment: assertINV1Conservation checks 5 of 8 INV-1 terms (not 9 — DeferredHorizonInterrupted is not a Metrics field). Keep autoscaler wiring tests and Bug 2 regression test added by this branch. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

vishakha-ramani · 2026-04-11T14:11:57Z

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

claude · 2026-04-11T14:12:30Z

Claude finished @vishakha-ramani's task in 14m 13s —— View job

PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

Initialize review comment
Gather context (read changed files, related issues, standards docs)
Run code review agent
Run test coverage agent
Run error handling / silent failure agent
Synthesize findings and post verdict

Overview

The core termination guard logic is sound for the stated scope (simple request-bounded, non-session, non-PD workloads), and the bundle validation, default pipeline wiring, and resolvePolicies refactor are all well-structured. However there are two correctness bugs that need to be resolved before merging, one test gap that should be addressed, and several smaller issues.

Critical — Must Fix Before Merge

1. pendingArrivals counter underflows for session / PD / drain-redirect workloads

pendingArrivals is incremented only in Run() at cluster.go:516, one increment per initial request. However there are six additional sites that push a ClusterArrivalEvent without a matching increment:

File	Line	Context
`cluster.go`	444	Session follow-up requests from `OnRequestDone` closure
`cluster.go`	895	PD session follow-ups in `detectDecodeCompletions`
`cluster.go`	920	PD decode-timeout session follow-ups
`cluster.go`	1073	`promoteDeferred()` re-injection
`infra_lifecycle_event.go`	113	`NodeReadyEvent` follow-ups
`infra_lifecycle_event.go`	334	Drain-redirect re-injection

Every execution of these ClusterArrivalEvents decrements pendingArrivals at cluster_event.go:108 without a preceding increment. In session-mode runs the counter goes negative; the guard cs.pendingArrivals == 0 is never satisfied; the original infinite-loop bug reappears for any workload combining a request-bounded horizon with sessions, PD disaggregation, or drain-redirects and an enabled autoscaler.

The PR description scopes this fix to "workloads without batch/background SLO-class requests" but does not mention this broader class of affected workloads. This is a silent regression for those combinations.

Minimal fix — change the guard from == 0 to <= 0, or add a warn-and-clamp to Execute() (matching the existing inFlightRequests underflow guard pattern):

// cluster_event.go:Execute
cs.pendingArrivals--
if cs.pendingArrivals < 0 {
    logrus.Warnf("[cluster] pendingArrivals went negative (%d) — session/PD/drain re-injection without increment; autoscaler guard will use <= 0 semantics", cs.pendingArrivals)
}

and update the guard in autoscaler.go:287 to cs.pendingArrivals <= 0.

Full fix — add cs.pendingArrivals++ at all six push sites for complete bookkeeping correctness.

2. Bundle cross-validation gap allows runtime panic from one-sided threshold override (R3 violation)

bundle.go:263–265 cross-checks the threshold pair only when both are explicitly non-zero:

if a.ScaleUpThreshold != 0 && a.ScaleDownBoundary != 0 && a.ScaleDownBoundary >= a.ScaleUpThreshold {

A user who writes scale_down_boundary: 0.9 and leaves scale_up_threshold unset (zero = use default 0.8) passes Validate(). Then effectiveAnalyzerConfig fills in ScaleUpThreshold = 0.8, and NewV2SaturationAnalyzer panics because 0.9 >= 0.8. This violates R3 (validate at CLI boundary). The same problem exists in the reverse direction (scale_up_threshold: 0.3 with default scale_down_boundary = 0.4).

Fix: in Validate(), substitute known defaults before the cross-check:

effectiveUp := a.ScaleUpThreshold
if effectiveUp == 0 { effectiveUp = 0.8 }
effectiveDown := a.ScaleDownBoundary
if effectiveDown == 0 { effectiveDown = 0.4 }
if effectiveDown >= effectiveUp {
    return fmt.Errorf("autoscaler.analyzer.scale_down_boundary (%v) must be < effective scale_up_threshold (%v)", effectiveDown, effectiveUp)
}

Also add a test case to TestPolicyBundle_Validate_AnalyzerThresholds:

{"scale_down_boundary above default scale_up_threshold", AnalyzerBundleConfig{ScaleDownBoundary: 0.9}, true},
{"scale_up_threshold below default scale_down_boundary", AnalyzerBundleConfig{ScaleUpThreshold: 0.3}, true},

Important — Should Fix

3. TestAutoscaler_RequestBoundedRun_Terminates does not verify INV-1 conservation (cluster_test.go:2601)

The test confirms Run() exits without error within 2 s but does not assert that all 10 requests completed. A future off-by-one in pendingArrivals bookkeeping could cause premature termination with silent data loss while the test still passes. Add after <-done:

agg := cs.AggregatedMetrics()
if agg.CompletedRequests != 10 {
    t.Errorf("INV-1: expected 10 completed, got %d (queued=%d, running=%d)",
        agg.CompletedRequests, agg.StillQueued, agg.StillRunning)
}

4. Stale comment in cluster.go (~line 348–350) says pipeline components "are injected by tests or cmd/ after construction." That is no longer true — the default WVA pipeline is now wired inside the constructor. Update the comment to reflect that tests needing custom components replace cs.autoscaler after construction.

5. Stale newAutoscalerPipeline doc comment (autoscaler.go:197) says "Components may be nil at construction time — they are injected by tests or cmd/ before Run()." Same issue as above; misleads tests that rely on the old nil-at-construction contract.

6. Bundle node pool validation is incomplete (bundle.go:267–284)

NodePoolConfig.IsValid() in infra_config.go checks several conditions that the bundle validator omits:

min_nodes >= 0
min_nodes <= max_nodes (bundle only checks max >= initial, not max >= min)
initial_nodes >= 0
NaN/Inf guards on gpu_memory_gib, provisioning_delay.mean/.stddev, cost_per_hour
cost_per_hour >= 0

A user passing min_nodes: -5 or cost_per_hour: -10 in a bundle YAML passes Validate() and then hits an uncontrolled panic in NewPlacementManager rather than a clean logrus.Fatalf. Align the bundle validator with the full IsValid() check set. Fix this →

7. No warning when replay silently drops autoscaler/node_pools bundle sections (replay.go:166)

parsedScorerConfigs, _ := resolvePolicies(cmd)

If a user passes --policy-config containing an autoscaler or node_pools section to blis replay, those sections are silently ignored. Since resolvePolicies does apply admission, routing, priority, scheduler, tenant_budgets, and slo_priorities from the same bundle, the user sees partial application with no diagnostic. A logrus.Warnf when bundle.Autoscaler.IntervalUs > 0 or len(bundle.NodePools) > 0 would surface the mismatch. Fix this →

Minor — Suggestions

8. Double blank line (cluster.go:383–384) — two consecutive empty lines between the autoscaler wiring block and TenantTracker init. Go convention allows at most one.

9. Structural wiring test (cluster_test.go:2574–2585) accesses unexported autoscaler.collector/.analyzer/.engine/.actuator fields directly. Per CLAUDE.md BDD/TDD "refactor survival" principle, this test breaks if the pipeline struct is reorganized even when behavior is preserved. The behavioral companion TestAutoscaler_RequestBoundedRun_Terminates is the authoritative test; the wiring test is acceptable as a same-package smoke-test but should be recognized as structural.

10. Zero cooldowns silently used when autoscaler enabled via CLI flag without bundle — --model-autoscaler-interval-us without a bundle YAML results in ScaleUpCooldownUs == 0 and ScaleDownCooldownUs == 0, which means every tick fires a scale decision with no cooldown. A log line noting the zero-cooldown state would help users diagnose oscillating scale behavior.

11. Missing effectiveAnalyzerConfig unit test — the function fills zero fields with WVA reference defaults (KvCacheThreshold=0.8, ScaleUpThreshold=0.8, ScaleDownBoundary=0.4, AvgInputTokens=512). A change to any of these constants would silently pass bundle.Validate() but panic inside NewV2SaturationAnalyzer. A TestEffectiveAnalyzerConfig_ZeroFields_GetDefaults in cluster_test.go would close this.

Positive Observations

The termination guard logic in scheduleNextTick is correct and well-commented for the primary use case.
effectiveAnalyzerConfig cleanly separates "configure from zero" from "validate" responsibilities.
The TestLoadPolicyBundle_AutoscalerAbsent_IsZero backward-compatibility test is exactly right — existing YAML files without autoscaler sections must parse cleanly.
TestPolicyBundle_Validate_AnalyzerThresholds with 9 table-driven cases is idiomatic Go and covers the main boundary conditions.
TestLoadPolicyBundle_NodePoolsSection validating nested ProvisioningDelay deserialization is good defensive testing.
resolvePolicies refactor cleanly eliminates a duplicate LoadPolicyBundle call.
The goroutine + time.After pattern in TestAutoscaler_RequestBoundedRun_Terminates is the correct way to catch infinite loops without hanging CI.
R6 compliant — no logrus.Fatalf calls added to sim/ library code.

Verdict: Not Ready to Merge

Two correctness bugs must be fixed:

pendingArrivals underflow (cluster_event.go:108, six push sites without increment) — causes the original infinite-loop bug to reappear for autoscaler-enabled session/PD/drain workloads. Fix: <= 0 guard or warn-and-clamp.
Bundle cross-validation gap (bundle.go:263) — one-sided threshold overrides can pass Validate() but cause a NewV2SaturationAnalyzer panic at runtime. Fix: substitute effective defaults before the cross-check.
INV-1 not verified in regression test (cluster_test.go:2601) — required to prove the guard fires correctly, not just that the run terminates early.

Items 4–7 (stale comments, incomplete node pool validation, silent replay drop) are strong recommendations. Items 8–11 can follow as cleanup.

Critical fixes: - Add pendingArrivals++ at 5 missing ClusterArrivalEvent push sites (session follow-ups in cluster.go, PD callbacks in detectDecodeCompletions, NodeReadyEvent callbacks in infra_lifecycle_event.go, drain-redirect re-injection) — prevents counter underflow and autoscaler infinite loop for session/PD/drain workloads - Change scheduleNextTick guard from == 0 to <= 0 as safety net - Add INV-1 (CompletedRequests) assertion to TestAutoscaler_RequestBoundedRun_Terminates - Fix bundle cross-validation: substitute effective defaults before scale_down_boundary vs scale_up_threshold cross-check (R3 — one-sided override with default on the other side would panic at runtime) Important fixes: - Delegate node pool bundle validation to full inline check set mirroring NodePoolConfig.IsValid() (min_nodes, initial_nodes, NaN/Inf guards, cost_per_hour, min_nodes <= max_nodes) - Warn in replay when bundle has autoscaler/node_pools sections (silent drop) - Update stale cluster.go comment: default WVA pipeline wired in constructor - Update stale newAutoscalerPipeline doc comment Minor: - Remove double blank line in cluster.go - Add two one-sided-override test cases to TestPolicyBundle_Validate_AnalyzerThresholds Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

vishakha-ramani · 2026-04-12T02:14:18Z

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

claude · 2026-04-12T02:14:43Z

Claude finished @vishakha-ramani's task in 6m 0s —— View job

PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

Initialize review comment
Check what changed since previous review (git log, diff)
Review the changed files for correctness
Verify CI status
Synthesize findings and post verdict

Context

This review covers the full PR diff against origin/main, with particular attention to the fix(autoscaler): address PR review comments commit (a3529ee) which addressed all findings from the prior review. CI is green (1 run, success).

Previous Critical Issues: All Resolved ✓

1. pendingArrivals underflow for session/PD/drain workloads

All 5 active push sites now have cs.pendingArrivals++:

cluster.go:443 — session follow-ups in OnRequestDone closure
cluster.go:895 — PD session follow-ups in detectDecodeCompletions
cluster.go:921 — PD decode-timeout follow-ups
infra_lifecycle_event.go:113 — NodeReadyEvent session re-injection
infra_lifecycle_event.go:334 — drain-redirect re-injection

The promoteDeferred() function at cluster.go:1071 still lacks the increment but is //nolint:unused with no call sites — dead code, not a concern.

The guard in scheduleNextTick now uses cs.pendingArrivals <= 0 (changed from == 0), providing an additional safety net for any future missed push site.

2. Bundle cross-validation gap (one-sided threshold override)

bundle.go:263–272 now substitutes effective defaults (0.8 for ScaleUpThreshold, 0.4 for ScaleDownBoundary) before the cross-check — matching effectiveAnalyzerConfig in cluster.go. The two previously-missing one-sided test cases are present:

{"scale_down_boundary above default scale_up_threshold (one-sided)", AnalyzerBundleConfig{ScaleDownBoundary: 0.9}, true},
{"scale_up_threshold below default scale_down_boundary (one-sided)", AnalyzerBundleConfig{ScaleUpThreshold: 0.3}, true},

3. INV-1 not verified in regression test

TestAutoscaler_RequestBoundedRun_Terminates now asserts agg.CompletedRequests == len(reqs) after Run() exits, preventing false-pass on premature termination.

Previous Important Issues: All Resolved ✓

Issue	Status
Stale cluster.go comment (pipeline wired in constructor, not injected after)	Fixed
Stale `newAutoscalerPipeline` doc comment	Fixed
Node pool bundle validation incomplete (min_nodes, initial_nodes, NaN/Inf, cost)	Fixed — full inline check set mirrors `NodePoolConfig.IsValid()`
Replay silent drop of autoscaler/node_pools bundle sections	Fixed — `logrus.Warnf` for both cases in `replay.go`
Double blank line in `cluster.go`	Fixed

Remaining Minor Observations

1. No regression test for autoscaler + session-mode + request-bounded horizon

TestAutoscaler_RequestBoundedRun_Terminates uses 10 plain (non-session) requests. The new pendingArrivals++ in the OnRequestDone closure (cluster.go:443) is not exercised. A future push site added without an increment would go undetected. The <= 0 guard is a safety net, not a test. Considered low risk for this PR but worth a follow-up test.

2. Missing test case for min_nodes > max_nodes in node pool validation

bundle.go:304 validates MinNodes > MaxNodes, but no test covers this path. TestPolicyBundle_Validate_NodePool_MaxLessThanInitial covers the sibling check (MaxNodes < InitialNodes) — a mirror test for the min/max relationship would close the gap cleanly.

3. Duplicate default constants between bundle.go and cluster.go

Both files hardcode 0.8/0.4 as the effective defaults for ScaleUpThreshold/ScaleDownBoundary. The PR adds comments marking these as "mirrors effectiveAnalyzerConfig," which is good. A named constant shared between the two packages would be cleaner but requires an API decision. Acceptable for now.

4. No log line when autoscaler is enabled via CLI flag without a bundle

When --model-autoscaler-interval-us > 0 is passed without --policy-config, cooldowns default to 0 with no diagnostic. The previous review noted this; it was not addressed. Very low severity since 0-cooldown is valid behavior (every tick fires), but could confuse users debugging oscillating scale. Nice-to-have for follow-up.

5. No unit test for effectiveAnalyzerConfig zero→default behavior

If the default constants in effectiveAnalyzerConfig are changed, NewV2SaturationAnalyzer could panic for zero-config callers while all tests pass. Low risk since the function is 8 lines, but TestEffectiveAnalyzerConfig_ZeroFields_GetDefaults would close this. Previously raised; still not added.

Positive Observations

The <= 0 guard in scheduleNextTick is good defensive engineering — it absorbs any missed push site rather than hanging indefinitely.
effectiveAnalyzerConfig cleanly separates zero→default filling from validation; the caller contract (zero = use default) is clearly documented.
Default magic numbers in bundle.go are commented with // WVA reference default (mirrors effectiveAnalyzerConfig) — this documents the synchronization requirement explicitly.
The TestNewClusterSimulator_AutoscalerWiredWhenEnabled test verifies all four pipeline components are non-nil after construction — good structural test to confirm the wiring refactor.
All 11 new bundle tests (autoscaler section, node pools, backward-compat, analyzer thresholds with one-sided cases) follow idiomatic table-driven Go patterns.
R6 compliant: no new logrus.Fatalf in sim/ library code.
resolvePolicies refactor cleanly eliminates the duplicate LoadPolicyBundle call without changing any observable behavior.

Verdict: Ready to Merge

All critical issues from the previous review are resolved. All important issues are resolved. CI is green. The remaining items (observations 1–5 above) are genuine improvements but none are blockers. They are appropriate candidates for follow-up issues or the Bug 3 PR if already in context.

sriumcp · 2026-04-13T12:36:11Z

@claude What is this PR fixing? Will the fix work for both non-PD-disaggregated and PD-disaggregated scenarios? Is it respecting the module boundaries and separation of concerns in the project?

claude · 2026-04-13T12:37:10Z

Claude finished @sriumcp's task in 2m 19s —— View job

Analysis: PR #1024 — What It Fixes, PD Coverage, and Module Boundaries

Tasks

Initialize comment
Read key changed files (autoscaler.go, cluster.go, cluster_event.go, infra_lifecycle_event.go, bundle.go)
Analyze PD-disaggregated paths
Assess module boundary compliance
Post findings

1. What is this PR fixing?

Root cause: In sim/cluster/autoscaler.go, the scheduleNextTick method unconditionally pushed the next ScalingTickEvent onto the cluster event queue after every autoscaler tick — it had no exit condition. On a request-bounded run (no explicit --horizon, so Horizon == math.MaxInt64), the DES event loop terminates only when the event queue is empty. But the autoscaler's self-scheduling ScalingTickEvent kept the queue alive indefinitely after all requests finished, causing an infinite loop at 100%+ CPU.

The fix (autoscaler.go:290–298) adds a termination guard to scheduleNextTick:

if cs.config.Horizon == math.MaxInt64 && cs.pendingArrivals <= 0 {
    var inFlight int
    for _, v := range cs.inFlightRequests { inFlight += v }
    if inFlight == 0 {
        return // no more work; don't self-schedule
    }
}

To make this work, a new pendingArrivals int field tracks the number of ClusterArrivalEvents currently in the queue at any moment. It is:

Incremented in Run() for each initial arrival pushed (cluster.go:516)
Decremented in ClusterArrivalEvent.Execute() when each arrival is processed (cluster_event.go:108)

The PR also bundles prerequisite wiring (policy bundle + autoscaler fields, cluster constructor, CLI flag), bundle validation for autoscaler/node-pool config, and a resolvePolicies refactor.

2. Does the fix work for both non-PD and PD-disaggregated scenarios?

Short answer: Yes, correctly for both paths. Here is how each path is covered:

Non-PD (standard) path

Requests arrive via ClusterArrivalEvent → admission → routing → instance. Session follow-ups are pushed from the OnRequestDone closure in NewClusterSimulator. This push site has pendingArrivals++ (cluster.go:446), so the counter stays consistent.

PD-disaggregated path

When a request is disaggregated, the flow is:

ClusterArrivalEvent.Execute() fires → pendingArrivals-- (one decrement for the arrival)
Request flows to DisaggregationDecisionEvent → PrefillRoutingEvent (not a new ClusterArrivalEvent — no second decrement)
Prefill completes → decode sub-request routed within the instance layer
Decode completes → detectDecodeCompletions() may push follow-up ClusterArrivalEvents for session workloads

The two PD-specific push sites in detectDecodeCompletions both have matching increments:

cluster.go:898 — PD session follow-ups: cs.pendingArrivals++
cluster.go:924 — PD decode-timeout session follow-ups: cs.pendingArrivals++

For non-session PD workloads (most common), detectDecodeCompletions pushes no new ClusterArrivalEvents, so pendingArrivals correctly drains to ≤0 after all initial arrivals execute. The inFlight half of the guard (tracking both prefill and decode sub-requests routed through RoutingDecisionEvent) keeps the autoscaler ticking while PD work is actually in progress, then stops it cleanly when both conditions are met.

Infrastructure events (drain, NodeReadyEvent)

infra_lifecycle_event.go:116 — session follow-ups from newly-ready nodes: pendingArrivals++
infra_lifecycle_event.go:338 — drain-redirected requests: pendingArrivals++

Both covered.

Dead code note

promoteDeferred() at cluster.go:1071 pushes ClusterArrivalEvents without a pendingArrivals++, but it is marked //nolint:unused with zero call sites — it's dead code that carries no execution risk. If it is ever reactivated, the increment would need to be added.

Safety net

The guard uses <= 0 (not == 0), which absorbs any future push site that is accidentally added without an increment rather than looping forever.

3. Does the PR respect module boundaries and separation of concerns?

Yes, with one minor observation.

The project's module boundary rule is cmd/ → sim/cluster/ → sim/ (no reverse dependencies). Here is how each changed component aligns:

Component	Location	Verdict
`scheduleNextTick` guard + `pendingArrivals`	`sim/cluster/`	Correctly cluster-internal; not exposed to `cmd/`
`autoscalerPipeline` struct + `newAutoscalerPipeline`	`sim/cluster/autoscaler.go`	Private, correct constructor pattern (R4)
Default WVA pipeline wiring	`NewClusterSimulator` in `sim/cluster/cluster.go`	Wired at construction, not injected post-hoc — correct (R4)
`bundle.Validate()` for autoscaler/node-pool thresholds	`sim/bundle.go`	Validates at the `sim/` layer boundary (R3); available to both `cmd/` and tests
`resolvePolicies()` refactor + autoscaler config translation	`cmd/root.go`	CLI-to-config translation lives entirely in `cmd/` — correct
`--model-autoscaler-interval-us` flag	`cmd/root.go`	CLI entry point only; config passed down as `ClusterConfig.ModelAutoscalerIntervalUs` — no `cmd/` dependency in `sim/`
Replay warning for ignored autoscaler sections	`cmd/replay.go`	Diagnostic in `cmd/`; `sim/cluster/` never knows replay dropped the sections — correct

Minor observation: The pendingArrivals++ increments are scattered across 6 sites in two files (cluster.go lines 446, 898, 924, and infra_lifecycle_event.go lines 116, 338) plus Run() at line 516. This is a necessary consequence of the event-driven architecture — every push site must maintain the invariant locally. The pattern is consistent and each site has a comment explaining the co-change requirement (e.g., // mirror Run() — session follow-ups must be tracked). The <= 0 guard provides a backstop. This is a reasonable design tradeoff given the DES architecture.

Summary

This PR correctly fixes the hang for all request-bounded workloads, including PD-disaggregated and session workloads. The autoscaler terminates when there are no more pending arrivals and no in-flight requests on any instance — covering both simple and disaggregated request lifecycles. Module boundaries are respected: the termination logic is fully encapsulated in sim/cluster/, validation at the sim/ boundary, CLI wiring in cmd/.

sriumcp · 2026-04-13T14:30:06Z

@claude you mentioned

Minor observation: The pendingArrivals++ increments are scattered across 6 sites in two files (cluster.go lines 446, 898, 924, and infra_lifecycle_event.go lines 116, 338) plus Run() at line 516. This is a necessary consequence of the event-driven architecture — every push site must maintain the invariant locally. The pattern is consistent and each site has a comment explaining the co-change requirement (e.g., // mirror Run() — session follow-ups must be tracked). The <= 0 guard provides a backstop. This is a reasonable design tradeoff given the DES architecture.

Is there a design fix for this, and also other things like this so that variables like this don't get scattered throughout the codebase but can be updated and maintained in a cleaner way.

claude · 2026-04-13T14:30:36Z

Claude finished @sriumcp's task in 2m 30s —— View job

Design Fix for Scattered Counter Sites

Tasks

Initialize comment
Read relevant files and trace all scatter sites
Analyze the scattered counter pattern and identify the canonical fix
Propose clean design alternatives with tradeoffs
Generalize to other BLIS patterns

The Core Problem

The pendingArrivals++ co-requirement is an example of a co-invariant operation: every time operation X happens (push a ClusterArrivalEvent), side-effect Y must also happen (increment the counter). The current code enforces this with comments at each call site rather than with structure. Comments don't compile.

There are currently 6 push sites across 2 files, each with a "// mirror Run() —" reminder. Every future push site (e.g., if promoteDeferred is reactivated, or a new session variant is added) requires the developer to know the co-requirement exists and remember to apply it.

The Fix: A Canonical Push Method

Add a single method to ClusterSimulator that owns the push-and-increment invariant:

// pushArrival enqueues a ClusterArrivalEvent and increments pendingArrivals atomically.
// All ClusterArrivalEvent pushes MUST go through this method — it is the single
// enforcement point for the pendingArrivals invariant used by scheduleNextTick.
func (cs *ClusterSimulator) pushArrival(req *sim.Request, timeUs int64) {
    heap.Push(&cs.clusterEvents, clusterEventEntry{
        event: &ClusterArrivalEvent{time: timeUs, request: req},
        seqID: cs.nextSeqID(),
    })
    cs.pendingArrivals++
}

Every push site in cluster.go, infra_lifecycle_event.go, and any future file collapses to:

// before (6 sites, each with a co-change comment):
heap.Push(&cs.clusterEvents, clusterEventEntry{
    event: &ClusterArrivalEvent{time: next.ArrivalTime, request: next},
    seqID: cs.nextSeqID(),
})
cs.pendingArrivals++ // mirror Run() — session follow-ups must be tracked

// after (1 method, no comment needed):
cs.pushArrival(next, next.ArrivalTime)

The decrement stays in ClusterArrivalEvent.Execute() — that's already a single, canonical location (every arrival executes exactly once).

This is the direct application of R4 (canonical constructor) extended to operations: just as R4 says "struct literals in exactly one place," the canonical operation pattern says "co-invariant mutations in exactly one method." Any future push site that bypasses pushArrival will be obviously wrong during code review, without relying on the reviewer knowing about pendingArrivals.

The General Principle

The pendingArrivals scatter is one instance of a recurring pattern in event-driven simulators. The general rule is:

When a state mutation must always accompany a data-structure operation, wrap both in a single method — never scatter the mutation across K call sites with co-change reminder comments.

Other places in BLIS where this already applies well or could be tightened:

Pattern	Current state	Status
`pendingArrivals++` with `heap.Push(ClusterArrivalEvent)`	6 scattered sites + comments	Fix with `pushArrival`
`inFlightRequests[id]++` with routing dispatch	Single site (`cluster_event.go:297`), decrement in one place (`cluster.go:578`)	Already clean — only one increment site
`shedByTier[tier]++` on rejection	After this PR: correctly guarded in one block in `AdmissionDecisionEvent.Execute`	Already clean
`cs.nextSeqID()` always required with every `heap.Push`	Repeated at every push site	`pushArrival` absorbs this too for arrival events

Broader Applicability: When to Apply This Pattern

This is worth capturing as a standard. Any time you find yourself writing:

// (caller must also do X when doing Y)
// mirror <function> — <reason> must be tracked

...that is a signal to instead create a method that does both Y and X, and make the comment a doc comment on the method. The method becomes the contract; the comment becomes the documentation of the contract rather than the only enforcement of it.

The <=0 guard in scheduleNextTick is a correct safety net against a missed call site, but it should be treated as a last-resort backstop, not the primary defense. The canonical method is the primary defense.

Verdict on This PR

This is a follow-up design improvement, not a blocker for this PR. The current implementation is correct (all sites are covered, <=0 guard provides a backstop). But pushArrival would be a clean incremental improvement — either as a follow-up issue or as an addition to the Bug 3 PR which is already stacked on this branch and touches promoteDeferred(), the one dead-code site that currently lacks the increment.

…rival Introduce pushArrival(req, timeUs) as the single enforcement point for the pendingArrivals invariant. Previously, pendingArrivals++ was scattered across 6 call sites in two files with co-change reminder comments. Any future push site that bypasses pushArrival is obviously wrong at review time, without requiring the reviewer to know about pendingArrivals. Also updates the dead-code promoteDeferred() path so it is correct if ever reactivated. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…o pushArrival" This reverts commit 4d76564.

vishakha-ramani · 2026-04-14T14:39:38Z

The pushArrival refactor suggested here will be handled in the Bug 3 fix (#1023), which already touches promoteDeferred() — the one dead-code site that currently lacks the increment. That PR will consolidate all ClusterArrivalEvent pushes into the single canonical method.

vishakha-ramani and others added 4 commits April 10, 2026 18:19

feat(bundle): add node_pools and autoscaler sections to PolicyBundle

60fa71b

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

This was referenced Apr 10, 2026

fix(autoscaler): promoteDeferred must increment pendingArrivals (Bug 3) #1025

Closed

fix(autoscaler): scaleUp stub never created InstanceSimulator (Bug 1) #1026

Closed

vishakha-ramani and others added 2 commits April 10, 2026 18:47

Merge branch 'main' into fix/autoscaler-bug2-termination-guard

8328211

vishakha-ramani and others added 3 commits April 14, 2026 10:22

Merge branch 'main' into fix/autoscaler-bug2-termination-guard

4836868

Revert "refactor(cluster): consolidate ClusterArrivalEvent pushes int…

69d4b24

…o pushArrival" This reverts commit 4d76564.

sriumcp merged commit d26cf40 into inference-sim:main Apr 14, 2026
1 of 2 checks passed

vishakha-ramani mentioned this pull request Apr 14, 2026

refactor(cluster): canonical pushArrival method to enforce pendingArrivals co-invariant #1041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(autoscaler): termination guard for request-bounded runs (Bug 2)#1024

fix(autoscaler): termination guard for request-bounded runs (Bug 2)#1024
sriumcp merged 10 commits intoinference-sim:mainfrom
vishakha-ramani:fix/autoscaler-bug2-termination-guard

vishakha-ramani commented Apr 10, 2026 •

edited

Loading

Uh oh!

vishakha-ramani commented Apr 11, 2026

Uh oh!

claude Bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

vishakha-ramani commented Apr 12, 2026

Uh oh!

claude Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

sriumcp commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

sriumcp commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

vishakha-ramani commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vishakha-ramani commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR contains

Wiring (prerequisite, shared with other bug PRs)

Bug 2 fix

Wiring improvements (reviewed once here, shared basis for Bug 3 and Bug 1 PRs)

Test

Test plan

Uh oh!

vishakha-ramani commented Apr 11, 2026

Uh oh!

claude Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

Overview

Critical — Must Fix Before Merge

Important — Should Fix

Minor — Suggestions

Positive Observations

Verdict: Not Ready to Merge

Uh oh!

vishakha-ramani commented Apr 12, 2026

Uh oh!

claude Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

Context

Previous Critical Issues: All Resolved ✓

Previous Important Issues: All Resolved ✓

Remaining Minor Observations

Positive Observations

Verdict: Ready to Merge

Uh oh!

sriumcp commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis: PR #1024 — What It Fixes, PD Coverage, and Module Boundaries

1. What is this PR fixing?

2. Does the fix work for both non-PD and PD-disaggregated scenarios?

Non-PD (standard) path

PD-disaggregated path

Infrastructure events (drain, NodeReadyEvent)

Dead code note

Safety net

3. Does the PR respect module boundaries and separation of concerns?

Summary

Uh oh!

sriumcp commented Apr 13, 2026

Uh oh!

claude Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Fix for Scattered Counter Sites

The Core Problem

The Fix: A Canonical Push Method

The General Principle

Broader Applicability: When to Apply This Pattern

Verdict on This PR

Uh oh!

vishakha-ramani commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vishakha-ramani commented Apr 10, 2026 •

edited

Loading

claude Bot commented Apr 11, 2026 •

edited

Loading

claude Bot commented Apr 12, 2026 •

edited

Loading

claude Bot commented Apr 13, 2026 •

edited

Loading

claude Bot commented Apr 13, 2026 •

edited

Loading

vishakha-ramani commented Apr 14, 2026 •

edited

Loading