Skip to content

fix(autoscaler): termination guard for request-bounded runs (Bug 2)#1024

Merged
sriumcp merged 10 commits intoinference-sim:mainfrom
vishakha-ramani:fix/autoscaler-bug2-termination-guard
Apr 14, 2026
Merged

fix(autoscaler): termination guard for request-bounded runs (Bug 2)#1024
sriumcp merged 10 commits intoinference-sim:mainfrom
vishakha-ramani:fix/autoscaler-bug2-termination-guard

Conversation

@vishakha-ramani
Copy link
Copy Markdown
Contributor

@vishakha-ramani vishakha-ramani commented Apr 10, 2026

Summary

Fixes #1022 — autoscaler-enabled blis run with any request-bounded workload hung indefinitely at 100%+ CPU.

Part of tracking issue #994.

Merge order: This PR should land first. PR for Bug 3 (#1023) is stacked on top of this branch.


What this PR contains

Wiring (prerequisite, shared with other bug PRs)

Three commits that wire the autoscaler pipeline into the blis run CLI:

  • feat(bundle): add node_pools and autoscaler sections to PolicyBundle
  • feat(cluster): wire default autoscaler pipeline in NewClusterSimulator
  • feat(cmd): wire --model-autoscaler-interval-us flag and policy-config → cluster config translation

Bug 2 fix

scheduleNextTick in sim/cluster/autoscaler.go unconditionally pushed the next ScalingTickEvent at nowUs + intervalUs. On request-bounded runs (Horizon == math.MaxInt64), the DES loop terminates only when the event queue is empty — but ScalingTickEvent kept the queue alive forever.

Fix: Stop self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0.

New field pendingArrivals int on ClusterSimulator tracks the count of ClusterArrivalEvents in the queue at any moment:

  • Incremented in Run() per initial arrival push
  • Decremented in ClusterArrivalEvent.Execute()

Note: This fix alone is sufficient for workloads without batch/background SLO-class requests. The companion Bug 3 PR handles the case where promoteDeferred() corrupts the counter for batch workloads.

Wiring improvements (reviewed once here, shared basis for Bug 3 and Bug 1 PRs)

  • bundle.Validate(): add range validation for AnalyzerBundleConfig thresholds — catches invalid scale_up_threshold, scale_down_boundary, avg_input_tokens, kv_cache_threshold at the CLI boundary (R3) before they reach NewV2SaturationAnalyzer panics
  • resolvePolicies() refactored to return (*PolicyBundle, []ScorerConfig) — eliminates redundant second LoadPolicyBundle call in runCmd.RunE
  • .gitignore: add docs/superpowers/ entry

Test

TestAutoscaler_RequestBoundedRun_Terminates — creates a cluster with ModelAutoscalerIntervalUs = 100_000, runs 10 requests, asserts Run() completes within 2s (goroutine + time.After guard).

TestPolicyBundle_Validate_AnalyzerThresholds — 9-case table test for analyzer config validation.


Test plan

  • go test ./sim/... ./cmd/... passes
  • go build ./... succeeds
  • TestAutoscaler_RequestBoundedRun_Terminates passes and completes in < 2s
  • TestPolicyBundle_Validate_AnalyzerThresholds — all 9 cases pass

🤖 Generated with Claude Code

vishakha-ramani and others added 4 commits April 10, 2026 18:19
Add AutoscalerAnalyzerConfig to DeploymentConfig and effectiveAnalyzerConfig()
helper for WVA reference defaults. NewClusterSimulator now wires
DefaultCollector+V2SaturationAnalyzer+UnlimitedEngine+DirectActuator when
ModelAutoscalerIntervalUs > 0, so cmd/ only sets config fields.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… bundle conversion

Reads node_pools and autoscaler sections from --policy-config YAML and
converts them into DeploymentConfig. CLI flag --model-autoscaler-interval-us
overrides the YAML value when explicitly set.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…improvements

Bug 2 — scheduleNextTick had no termination guard (autoscaler.go):
ScalingTickEvent unconditionally rescheduled itself. On request-bounded
runs (Horizon == MaxInt64), the cluster event queue never drained →
infinite event loop at 100%+ CPU. Added guard: stop self-scheduling when
Horizon == MaxInt64 && pendingArrivals == 0 && inFlight == 0.

Added pendingArrivals int field to ClusterSimulator, incremented in
Run() per arrival push, decremented in ClusterArrivalEvent.Execute().
Invariant: pendingArrivals == count of ClusterArrivalEvents in event queue.

Closes inference-sim#1022

Also includes:
- AnalyzerBundleConfig range validation in bundle.Validate() (R3: catch
  invalid thresholds at CLI boundary before NewV2SaturationAnalyzer panics)
- resolvePolicies() refactored to return (*PolicyBundle, []ScorerConfig)
  eliminating redundant second LoadPolicyBundle call in runCmd.RunE
- .gitignore: add docs/superpowers/ entry (agent-local artifacts)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
vishakha-ramani and others added 2 commits April 10, 2026 18:47
Fix comment: assertINV1Conservation checks 5 of 8 INV-1 terms (not 9 —
DeferredHorizonInterrupted is not a Metrics field). Keep autoscaler
wiring tests and Bug 2 regression test added by this branch.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 11, 2026

Claude finished @vishakha-ramani's task in 14m 13s —— View job


PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

  • Initialize review comment
  • Gather context (read changed files, related issues, standards docs)
  • Run code review agent
  • Run test coverage agent
  • Run error handling / silent failure agent
  • Synthesize findings and post verdict

Overview

The core termination guard logic is sound for the stated scope (simple request-bounded, non-session, non-PD workloads), and the bundle validation, default pipeline wiring, and resolvePolicies refactor are all well-structured. However there are two correctness bugs that need to be resolved before merging, one test gap that should be addressed, and several smaller issues.


Critical — Must Fix Before Merge

1. pendingArrivals counter underflows for session / PD / drain-redirect workloads

pendingArrivals is incremented only in Run() at cluster.go:516, one increment per initial request. However there are six additional sites that push a ClusterArrivalEvent without a matching increment:

File Line Context
cluster.go 444 Session follow-up requests from OnRequestDone closure
cluster.go 895 PD session follow-ups in detectDecodeCompletions
cluster.go 920 PD decode-timeout session follow-ups
cluster.go 1073 promoteDeferred() re-injection
infra_lifecycle_event.go 113 NodeReadyEvent follow-ups
infra_lifecycle_event.go 334 Drain-redirect re-injection

Every execution of these ClusterArrivalEvents decrements pendingArrivals at cluster_event.go:108 without a preceding increment. In session-mode runs the counter goes negative; the guard cs.pendingArrivals == 0 is never satisfied; the original infinite-loop bug reappears for any workload combining a request-bounded horizon with sessions, PD disaggregation, or drain-redirects and an enabled autoscaler.

The PR description scopes this fix to "workloads without batch/background SLO-class requests" but does not mention this broader class of affected workloads. This is a silent regression for those combinations.

Minimal fix — change the guard from == 0 to <= 0, or add a warn-and-clamp to Execute() (matching the existing inFlightRequests underflow guard pattern):

// cluster_event.go:Execute
cs.pendingArrivals--
if cs.pendingArrivals < 0 {
    logrus.Warnf("[cluster] pendingArrivals went negative (%d) — session/PD/drain re-injection without increment; autoscaler guard will use <= 0 semantics", cs.pendingArrivals)
}

and update the guard in autoscaler.go:287 to cs.pendingArrivals <= 0.

Full fix — add cs.pendingArrivals++ at all six push sites for complete bookkeeping correctness.


2. Bundle cross-validation gap allows runtime panic from one-sided threshold override (R3 violation)

bundle.go:263–265 cross-checks the threshold pair only when both are explicitly non-zero:

if a.ScaleUpThreshold != 0 && a.ScaleDownBoundary != 0 && a.ScaleDownBoundary >= a.ScaleUpThreshold {

A user who writes scale_down_boundary: 0.9 and leaves scale_up_threshold unset (zero = use default 0.8) passes Validate(). Then effectiveAnalyzerConfig fills in ScaleUpThreshold = 0.8, and NewV2SaturationAnalyzer panics because 0.9 >= 0.8. This violates R3 (validate at CLI boundary). The same problem exists in the reverse direction (scale_up_threshold: 0.3 with default scale_down_boundary = 0.4).

Fix: in Validate(), substitute known defaults before the cross-check:

effectiveUp := a.ScaleUpThreshold
if effectiveUp == 0 { effectiveUp = 0.8 }
effectiveDown := a.ScaleDownBoundary
if effectiveDown == 0 { effectiveDown = 0.4 }
if effectiveDown >= effectiveUp {
    return fmt.Errorf("autoscaler.analyzer.scale_down_boundary (%v) must be < effective scale_up_threshold (%v)", effectiveDown, effectiveUp)
}

Also add a test case to TestPolicyBundle_Validate_AnalyzerThresholds:

{"scale_down_boundary above default scale_up_threshold", AnalyzerBundleConfig{ScaleDownBoundary: 0.9}, true},
{"scale_up_threshold below default scale_down_boundary", AnalyzerBundleConfig{ScaleUpThreshold: 0.3}, true},

Important — Should Fix

3. TestAutoscaler_RequestBoundedRun_Terminates does not verify INV-1 conservation (cluster_test.go:2601)

The test confirms Run() exits without error within 2 s but does not assert that all 10 requests completed. A future off-by-one in pendingArrivals bookkeeping could cause premature termination with silent data loss while the test still passes. Add after <-done:

agg := cs.AggregatedMetrics()
if agg.CompletedRequests != 10 {
    t.Errorf("INV-1: expected 10 completed, got %d (queued=%d, running=%d)",
        agg.CompletedRequests, agg.StillQueued, agg.StillRunning)
}

4. Stale comment in cluster.go (~line 348–350) says pipeline components "are injected by tests or cmd/ after construction." That is no longer true — the default WVA pipeline is now wired inside the constructor. Update the comment to reflect that tests needing custom components replace cs.autoscaler after construction.

5. Stale newAutoscalerPipeline doc comment (autoscaler.go:197) says "Components may be nil at construction time — they are injected by tests or cmd/ before Run()." Same issue as above; misleads tests that rely on the old nil-at-construction contract.

6. Bundle node pool validation is incomplete (bundle.go:267–284)

NodePoolConfig.IsValid() in infra_config.go checks several conditions that the bundle validator omits:

  • min_nodes >= 0
  • min_nodes <= max_nodes (bundle only checks max >= initial, not max >= min)
  • initial_nodes >= 0
  • NaN/Inf guards on gpu_memory_gib, provisioning_delay.mean/.stddev, cost_per_hour
  • cost_per_hour >= 0

A user passing min_nodes: -5 or cost_per_hour: -10 in a bundle YAML passes Validate() and then hits an uncontrolled panic in NewPlacementManager rather than a clean logrus.Fatalf. Align the bundle validator with the full IsValid() check set. Fix this →

7. No warning when replay silently drops autoscaler/node_pools bundle sections (replay.go:166)

parsedScorerConfigs, _ := resolvePolicies(cmd)

If a user passes --policy-config containing an autoscaler or node_pools section to blis replay, those sections are silently ignored. Since resolvePolicies does apply admission, routing, priority, scheduler, tenant_budgets, and slo_priorities from the same bundle, the user sees partial application with no diagnostic. A logrus.Warnf when bundle.Autoscaler.IntervalUs > 0 or len(bundle.NodePools) > 0 would surface the mismatch. Fix this →


Minor — Suggestions

8. Double blank line (cluster.go:383–384) — two consecutive empty lines between the autoscaler wiring block and TenantTracker init. Go convention allows at most one.

9. Structural wiring test (cluster_test.go:2574–2585) accesses unexported autoscaler.collector/.analyzer/.engine/.actuator fields directly. Per CLAUDE.md BDD/TDD "refactor survival" principle, this test breaks if the pipeline struct is reorganized even when behavior is preserved. The behavioral companion TestAutoscaler_RequestBoundedRun_Terminates is the authoritative test; the wiring test is acceptable as a same-package smoke-test but should be recognized as structural.

10. Zero cooldowns silently used when autoscaler enabled via CLI flag without bundle--model-autoscaler-interval-us without a bundle YAML results in ScaleUpCooldownUs == 0 and ScaleDownCooldownUs == 0, which means every tick fires a scale decision with no cooldown. A log line noting the zero-cooldown state would help users diagnose oscillating scale behavior.

11. Missing effectiveAnalyzerConfig unit test — the function fills zero fields with WVA reference defaults (KvCacheThreshold=0.8, ScaleUpThreshold=0.8, ScaleDownBoundary=0.4, AvgInputTokens=512). A change to any of these constants would silently pass bundle.Validate() but panic inside NewV2SaturationAnalyzer. A TestEffectiveAnalyzerConfig_ZeroFields_GetDefaults in cluster_test.go would close this.


Positive Observations

  • The termination guard logic in scheduleNextTick is correct and well-commented for the primary use case.
  • effectiveAnalyzerConfig cleanly separates "configure from zero" from "validate" responsibilities.
  • The TestLoadPolicyBundle_AutoscalerAbsent_IsZero backward-compatibility test is exactly right — existing YAML files without autoscaler sections must parse cleanly.
  • TestPolicyBundle_Validate_AnalyzerThresholds with 9 table-driven cases is idiomatic Go and covers the main boundary conditions.
  • TestLoadPolicyBundle_NodePoolsSection validating nested ProvisioningDelay deserialization is good defensive testing.
  • resolvePolicies refactor cleanly eliminates a duplicate LoadPolicyBundle call.
  • The goroutine + time.After pattern in TestAutoscaler_RequestBoundedRun_Terminates is the correct way to catch infinite loops without hanging CI.
  • R6 compliant — no logrus.Fatalf calls added to sim/ library code.

Verdict: Not Ready to Merge

Two correctness bugs must be fixed:

  1. pendingArrivals underflow (cluster_event.go:108, six push sites without increment) — causes the original infinite-loop bug to reappear for autoscaler-enabled session/PD/drain workloads. Fix: <= 0 guard or warn-and-clamp.
  2. Bundle cross-validation gap (bundle.go:263) — one-sided threshold overrides can pass Validate() but cause a NewV2SaturationAnalyzer panic at runtime. Fix: substitute effective defaults before the cross-check.
  3. INV-1 not verified in regression test (cluster_test.go:2601) — required to prove the guard fires correctly, not just that the run terminates early.

Items 4–7 (stale comments, incomplete node pool validation, silent replay drop) are strong recommendations. Items 8–11 can follow as cleanup.

Critical fixes:
- Add pendingArrivals++ at 5 missing ClusterArrivalEvent push sites
  (session follow-ups in cluster.go, PD callbacks in detectDecodeCompletions,
  NodeReadyEvent callbacks in infra_lifecycle_event.go, drain-redirect
  re-injection) — prevents counter underflow and autoscaler infinite loop
  for session/PD/drain workloads
- Change scheduleNextTick guard from == 0 to <= 0 as safety net
- Add INV-1 (CompletedRequests) assertion to TestAutoscaler_RequestBoundedRun_Terminates
- Fix bundle cross-validation: substitute effective defaults before
  scale_down_boundary vs scale_up_threshold cross-check (R3 — one-sided
  override with default on the other side would panic at runtime)

Important fixes:
- Delegate node pool bundle validation to full inline check set mirroring
  NodePoolConfig.IsValid() (min_nodes, initial_nodes, NaN/Inf guards,
  cost_per_hour, min_nodes <= max_nodes)
- Warn in replay when bundle has autoscaler/node_pools sections (silent drop)
- Update stale cluster.go comment: default WVA pipeline wired in constructor
- Update stale newAutoscalerPipeline doc comment

Minor:
- Remove double blank line in cluster.go
- Add two one-sided-override test cases to TestPolicyBundle_Validate_AnalyzerThresholds

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 12, 2026

Claude finished @vishakha-ramani's task in 6m 0s —— View job


PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)

Tasks

  • Initialize review comment
  • Check what changed since previous review (git log, diff)
  • Review the changed files for correctness
  • Verify CI status
  • Synthesize findings and post verdict

Context

This review covers the full PR diff against origin/main, with particular attention to the fix(autoscaler): address PR review comments commit (a3529ee) which addressed all findings from the prior review. CI is green (1 run, success).


Previous Critical Issues: All Resolved ✓

1. pendingArrivals underflow for session/PD/drain workloads

All 5 active push sites now have cs.pendingArrivals++:

  • cluster.go:443 — session follow-ups in OnRequestDone closure
  • cluster.go:895 — PD session follow-ups in detectDecodeCompletions
  • cluster.go:921 — PD decode-timeout follow-ups
  • infra_lifecycle_event.go:113NodeReadyEvent session re-injection
  • infra_lifecycle_event.go:334 — drain-redirect re-injection

The promoteDeferred() function at cluster.go:1071 still lacks the increment but is //nolint:unused with no call sites — dead code, not a concern.

The guard in scheduleNextTick now uses cs.pendingArrivals <= 0 (changed from == 0), providing an additional safety net for any future missed push site.

2. Bundle cross-validation gap (one-sided threshold override)

bundle.go:263–272 now substitutes effective defaults (0.8 for ScaleUpThreshold, 0.4 for ScaleDownBoundary) before the cross-check — matching effectiveAnalyzerConfig in cluster.go. The two previously-missing one-sided test cases are present:

{"scale_down_boundary above default scale_up_threshold (one-sided)", AnalyzerBundleConfig{ScaleDownBoundary: 0.9}, true},
{"scale_up_threshold below default scale_down_boundary (one-sided)", AnalyzerBundleConfig{ScaleUpThreshold: 0.3}, true},

3. INV-1 not verified in regression test

TestAutoscaler_RequestBoundedRun_Terminates now asserts agg.CompletedRequests == len(reqs) after Run() exits, preventing false-pass on premature termination.


Previous Important Issues: All Resolved ✓

Issue Status
Stale cluster.go comment (pipeline wired in constructor, not injected after) Fixed
Stale newAutoscalerPipeline doc comment Fixed
Node pool bundle validation incomplete (min_nodes, initial_nodes, NaN/Inf, cost) Fixed — full inline check set mirrors NodePoolConfig.IsValid()
Replay silent drop of autoscaler/node_pools bundle sections Fixed — logrus.Warnf for both cases in replay.go
Double blank line in cluster.go Fixed

Remaining Minor Observations

1. No regression test for autoscaler + session-mode + request-bounded horizon

TestAutoscaler_RequestBoundedRun_Terminates uses 10 plain (non-session) requests. The new pendingArrivals++ in the OnRequestDone closure (cluster.go:443) is not exercised. A future push site added without an increment would go undetected. The <= 0 guard is a safety net, not a test. Considered low risk for this PR but worth a follow-up test.

2. Missing test case for min_nodes > max_nodes in node pool validation

bundle.go:304 validates MinNodes > MaxNodes, but no test covers this path. TestPolicyBundle_Validate_NodePool_MaxLessThanInitial covers the sibling check (MaxNodes < InitialNodes) — a mirror test for the min/max relationship would close the gap cleanly.

3. Duplicate default constants between bundle.go and cluster.go

Both files hardcode 0.8/0.4 as the effective defaults for ScaleUpThreshold/ScaleDownBoundary. The PR adds comments marking these as "mirrors effectiveAnalyzerConfig," which is good. A named constant shared between the two packages would be cleaner but requires an API decision. Acceptable for now.

4. No log line when autoscaler is enabled via CLI flag without a bundle

When --model-autoscaler-interval-us > 0 is passed without --policy-config, cooldowns default to 0 with no diagnostic. The previous review noted this; it was not addressed. Very low severity since 0-cooldown is valid behavior (every tick fires), but could confuse users debugging oscillating scale. Nice-to-have for follow-up.

5. No unit test for effectiveAnalyzerConfig zero→default behavior

If the default constants in effectiveAnalyzerConfig are changed, NewV2SaturationAnalyzer could panic for zero-config callers while all tests pass. Low risk since the function is 8 lines, but TestEffectiveAnalyzerConfig_ZeroFields_GetDefaults would close this. Previously raised; still not added.


Positive Observations

  • The <= 0 guard in scheduleNextTick is good defensive engineering — it absorbs any missed push site rather than hanging indefinitely.
  • effectiveAnalyzerConfig cleanly separates zero→default filling from validation; the caller contract (zero = use default) is clearly documented.
  • Default magic numbers in bundle.go are commented with // WVA reference default (mirrors effectiveAnalyzerConfig) — this documents the synchronization requirement explicitly.
  • The TestNewClusterSimulator_AutoscalerWiredWhenEnabled test verifies all four pipeline components are non-nil after construction — good structural test to confirm the wiring refactor.
  • All 11 new bundle tests (autoscaler section, node pools, backward-compat, analyzer thresholds with one-sided cases) follow idiomatic table-driven Go patterns.
  • R6 compliant: no new logrus.Fatalf in sim/ library code.
  • resolvePolicies refactor cleanly eliminates the duplicate LoadPolicyBundle call without changing any observable behavior.

Verdict: Ready to Merge

All critical issues from the previous review are resolved. All important issues are resolved. CI is green. The remaining items (observations 1–5 above) are genuine improvements but none are blockers. They are appropriate candidates for follow-up issues or the Bug 3 PR if already in context.

@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Apr 13, 2026

@claude What is this PR fixing? Will the fix work for both non-PD-disaggregated and PD-disaggregated scenarios? Is it respecting the module boundaries and separation of concerns in the project?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

Claude finished @sriumcp's task in 2m 19s —— View job


Analysis: PR #1024 — What It Fixes, PD Coverage, and Module Boundaries

Tasks

  • Initialize comment
  • Read key changed files (autoscaler.go, cluster.go, cluster_event.go, infra_lifecycle_event.go, bundle.go)
  • Analyze PD-disaggregated paths
  • Assess module boundary compliance
  • Post findings

1. What is this PR fixing?

Root cause: In sim/cluster/autoscaler.go, the scheduleNextTick method unconditionally pushed the next ScalingTickEvent onto the cluster event queue after every autoscaler tick — it had no exit condition. On a request-bounded run (no explicit --horizon, so Horizon == math.MaxInt64), the DES event loop terminates only when the event queue is empty. But the autoscaler's self-scheduling ScalingTickEvent kept the queue alive indefinitely after all requests finished, causing an infinite loop at 100%+ CPU.

The fix (autoscaler.go:290–298) adds a termination guard to scheduleNextTick:

if cs.config.Horizon == math.MaxInt64 && cs.pendingArrivals <= 0 {
    var inFlight int
    for _, v := range cs.inFlightRequests { inFlight += v }
    if inFlight == 0 {
        return // no more work; don't self-schedule
    }
}

To make this work, a new pendingArrivals int field tracks the number of ClusterArrivalEvents currently in the queue at any moment. It is:

  • Incremented in Run() for each initial arrival pushed (cluster.go:516)
  • Decremented in ClusterArrivalEvent.Execute() when each arrival is processed (cluster_event.go:108)

The PR also bundles prerequisite wiring (policy bundle + autoscaler fields, cluster constructor, CLI flag), bundle validation for autoscaler/node-pool config, and a resolvePolicies refactor.


2. Does the fix work for both non-PD and PD-disaggregated scenarios?

Short answer: Yes, correctly for both paths. Here is how each path is covered:

Non-PD (standard) path

Requests arrive via ClusterArrivalEvent → admission → routing → instance. Session follow-ups are pushed from the OnRequestDone closure in NewClusterSimulator. This push site has pendingArrivals++ (cluster.go:446), so the counter stays consistent.

PD-disaggregated path

When a request is disaggregated, the flow is:

  1. ClusterArrivalEvent.Execute() fires → pendingArrivals-- (one decrement for the arrival)
  2. Request flows to DisaggregationDecisionEventPrefillRoutingEvent (not a new ClusterArrivalEvent — no second decrement)
  3. Prefill completes → decode sub-request routed within the instance layer
  4. Decode completes → detectDecodeCompletions() may push follow-up ClusterArrivalEvents for session workloads

The two PD-specific push sites in detectDecodeCompletions both have matching increments:

  • cluster.go:898 — PD session follow-ups: cs.pendingArrivals++
  • cluster.go:924 — PD decode-timeout session follow-ups: cs.pendingArrivals++

For non-session PD workloads (most common), detectDecodeCompletions pushes no new ClusterArrivalEvents, so pendingArrivals correctly drains to ≤0 after all initial arrivals execute. The inFlight half of the guard (tracking both prefill and decode sub-requests routed through RoutingDecisionEvent) keeps the autoscaler ticking while PD work is actually in progress, then stops it cleanly when both conditions are met.

Infrastructure events (drain, NodeReadyEvent)

  • infra_lifecycle_event.go:116 — session follow-ups from newly-ready nodes: pendingArrivals++
  • infra_lifecycle_event.go:338 — drain-redirected requests: pendingArrivals++

Both covered.

Dead code note

promoteDeferred() at cluster.go:1071 pushes ClusterArrivalEvents without a pendingArrivals++, but it is marked //nolint:unused with zero call sites — it's dead code that carries no execution risk. If it is ever reactivated, the increment would need to be added.

Safety net

The guard uses <= 0 (not == 0), which absorbs any future push site that is accidentally added without an increment rather than looping forever.


3. Does the PR respect module boundaries and separation of concerns?

Yes, with one minor observation.

The project's module boundary rule is cmd/ → sim/cluster/ → sim/ (no reverse dependencies). Here is how each changed component aligns:

Component Location Verdict
scheduleNextTick guard + pendingArrivals sim/cluster/ Correctly cluster-internal; not exposed to cmd/
autoscalerPipeline struct + newAutoscalerPipeline sim/cluster/autoscaler.go Private, correct constructor pattern (R4)
Default WVA pipeline wiring NewClusterSimulator in sim/cluster/cluster.go Wired at construction, not injected post-hoc — correct (R4)
bundle.Validate() for autoscaler/node-pool thresholds sim/bundle.go Validates at the sim/ layer boundary (R3); available to both cmd/ and tests
resolvePolicies() refactor + autoscaler config translation cmd/root.go CLI-to-config translation lives entirely in cmd/ — correct
--model-autoscaler-interval-us flag cmd/root.go CLI entry point only; config passed down as ClusterConfig.ModelAutoscalerIntervalUs — no cmd/ dependency in sim/
Replay warning for ignored autoscaler sections cmd/replay.go Diagnostic in cmd/; sim/cluster/ never knows replay dropped the sections — correct

Minor observation: The pendingArrivals++ increments are scattered across 6 sites in two files (cluster.go lines 446, 898, 924, and infra_lifecycle_event.go lines 116, 338) plus Run() at line 516. This is a necessary consequence of the event-driven architecture — every push site must maintain the invariant locally. The pattern is consistent and each site has a comment explaining the co-change requirement (e.g., // mirror Run() — session follow-ups must be tracked). The <= 0 guard provides a backstop. This is a reasonable design tradeoff given the DES architecture.


Summary

This PR correctly fixes the hang for all request-bounded workloads, including PD-disaggregated and session workloads. The autoscaler terminates when there are no more pending arrivals and no in-flight requests on any instance — covering both simple and disaggregated request lifecycles. Module boundaries are respected: the termination logic is fully encapsulated in sim/cluster/, validation at the sim/ boundary, CLI wiring in cmd/.

@sriumcp
Copy link
Copy Markdown
Collaborator

sriumcp commented Apr 13, 2026

@claude you mentioned

Minor observation: The pendingArrivals++ increments are scattered across 6 sites in two files (cluster.go lines 446, 898, 924, and infra_lifecycle_event.go lines 116, 338) plus Run() at line 516. This is a necessary consequence of the event-driven architecture — every push site must maintain the invariant locally. The pattern is consistent and each site has a comment explaining the co-change requirement (e.g., // mirror Run() — session follow-ups must be tracked). The <= 0 guard provides a backstop. This is a reasonable design tradeoff given the DES architecture.

Is there a design fix for this, and also other things like this so that variables like this don't get scattered throughout the codebase but can be updated and maintained in a cleaner way.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

Claude finished @sriumcp's task in 2m 30s —— View job


Design Fix for Scattered Counter Sites

Tasks

  • Initialize comment
  • Read relevant files and trace all scatter sites
  • Analyze the scattered counter pattern and identify the canonical fix
  • Propose clean design alternatives with tradeoffs
  • Generalize to other BLIS patterns

The Core Problem

The pendingArrivals++ co-requirement is an example of a co-invariant operation: every time operation X happens (push a ClusterArrivalEvent), side-effect Y must also happen (increment the counter). The current code enforces this with comments at each call site rather than with structure. Comments don't compile.

There are currently 6 push sites across 2 files, each with a "// mirror Run() —" reminder. Every future push site (e.g., if promoteDeferred is reactivated, or a new session variant is added) requires the developer to know the co-requirement exists and remember to apply it.


The Fix: A Canonical Push Method

Add a single method to ClusterSimulator that owns the push-and-increment invariant:

// pushArrival enqueues a ClusterArrivalEvent and increments pendingArrivals atomically.
// All ClusterArrivalEvent pushes MUST go through this method — it is the single
// enforcement point for the pendingArrivals invariant used by scheduleNextTick.
func (cs *ClusterSimulator) pushArrival(req *sim.Request, timeUs int64) {
    heap.Push(&cs.clusterEvents, clusterEventEntry{
        event: &ClusterArrivalEvent{time: timeUs, request: req},
        seqID: cs.nextSeqID(),
    })
    cs.pendingArrivals++
}

Every push site in cluster.go, infra_lifecycle_event.go, and any future file collapses to:

// before (6 sites, each with a co-change comment):
heap.Push(&cs.clusterEvents, clusterEventEntry{
    event: &ClusterArrivalEvent{time: next.ArrivalTime, request: next},
    seqID: cs.nextSeqID(),
})
cs.pendingArrivals++ // mirror Run() — session follow-ups must be tracked

// after (1 method, no comment needed):
cs.pushArrival(next, next.ArrivalTime)

The decrement stays in ClusterArrivalEvent.Execute() — that's already a single, canonical location (every arrival executes exactly once).

This is the direct application of R4 (canonical constructor) extended to operations: just as R4 says "struct literals in exactly one place," the canonical operation pattern says "co-invariant mutations in exactly one method." Any future push site that bypasses pushArrival will be obviously wrong during code review, without relying on the reviewer knowing about pendingArrivals.


The General Principle

The pendingArrivals scatter is one instance of a recurring pattern in event-driven simulators. The general rule is:

When a state mutation must always accompany a data-structure operation, wrap both in a single method — never scatter the mutation across K call sites with co-change reminder comments.

Other places in BLIS where this already applies well or could be tightened:

Pattern Current state Status
pendingArrivals++ with heap.Push(ClusterArrivalEvent) 6 scattered sites + comments Fix with pushArrival
inFlightRequests[id]++ with routing dispatch Single site (cluster_event.go:297), decrement in one place (cluster.go:578) Already clean — only one increment site
shedByTier[tier]++ on rejection After this PR: correctly guarded in one block in AdmissionDecisionEvent.Execute Already clean
cs.nextSeqID() always required with every heap.Push Repeated at every push site pushArrival absorbs this too for arrival events

Broader Applicability: When to Apply This Pattern

This is worth capturing as a standard. Any time you find yourself writing:

// (caller must also do X when doing Y)
// mirror <function> — <reason> must be tracked

...that is a signal to instead create a method that does both Y and X, and make the comment a doc comment on the method. The method becomes the contract; the comment becomes the documentation of the contract rather than the only enforcement of it.

The <=0 guard in scheduleNextTick is a correct safety net against a missed call site, but it should be treated as a last-resort backstop, not the primary defense. The canonical method is the primary defense.


Verdict on This PR

This is a follow-up design improvement, not a blocker for this PR. The current implementation is correct (all sites are covered, <=0 guard provides a backstop). But pushArrival would be a clean incremental improvement — either as a follow-up issue or as an addition to the Bug 3 PR which is already stacked on this branch and touches promoteDeferred(), the one dead-code site that currently lacks the increment.

vishakha-ramani and others added 3 commits April 14, 2026 10:22
…rival

Introduce pushArrival(req, timeUs) as the single enforcement point for
the pendingArrivals invariant. Previously, pendingArrivals++ was scattered
across 6 call sites in two files with co-change reminder comments. Any
future push site that bypasses pushArrival is obviously wrong at review
time, without requiring the reviewer to know about pendingArrivals.

Also updates the dead-code promoteDeferred() path so it is correct if
ever reactivated.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

vishakha-ramani commented Apr 14, 2026

The pushArrival refactor suggested here will be handled in the Bug 3 fix (#1023), which already touches promoteDeferred() — the one dead-code site that currently lacks the increment. That PR will consolidate all ClusterArrivalEvent pushes into the single canonical method.

@sriumcp sriumcp merged commit d26cf40 into inference-sim:main Apr 14, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(autoscaler): scheduleNextTick had no termination guard — request-bounded runs hung indefinitely

2 participants