fix(autoscaler): termination guard for request-bounded runs (Bug 2)#1024
Conversation
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add AutoscalerAnalyzerConfig to DeploymentConfig and effectiveAnalyzerConfig() helper for WVA reference defaults. NewClusterSimulator now wires DefaultCollector+V2SaturationAnalyzer+UnlimitedEngine+DirectActuator when ModelAutoscalerIntervalUs > 0, so cmd/ only sets config fields. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… bundle conversion Reads node_pools and autoscaler sections from --policy-config YAML and converts them into DeploymentConfig. CLI flag --model-autoscaler-interval-us overrides the YAML value when explicitly set. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…improvements Bug 2 — scheduleNextTick had no termination guard (autoscaler.go): ScalingTickEvent unconditionally rescheduled itself. On request-bounded runs (Horizon == MaxInt64), the cluster event queue never drained → infinite event loop at 100%+ CPU. Added guard: stop self-scheduling when Horizon == MaxInt64 && pendingArrivals == 0 && inFlight == 0. Added pendingArrivals int field to ClusterSimulator, incremented in Run() per arrival push, decremented in ClusterArrivalEvent.Execute(). Invariant: pendingArrivals == count of ClusterArrivalEvents in event queue. Closes inference-sim#1022 Also includes: - AnalyzerBundleConfig range validation in bundle.Validate() (R3: catch invalid thresholds at CLI boundary before NewV2SaturationAnalyzer panics) - resolvePolicies() refactored to return (*PolicyBundle, []ScorerConfig) eliminating redundant second LoadPolicyBundle call in runCmd.RunE - .gitignore: add docs/superpowers/ entry (agent-local artifacts) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Fix comment: assertINV1Conservation checks 5 of 8 INV-1 terms (not 9 — DeferredHorizonInterrupted is not a Metrics field). Keep autoscaler wiring tests and Bug 2 regression test added by this branch. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue. Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess: Correctness and preservation of invariants Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required? |
|
Claude finished @vishakha-ramani's task in 14m 13s —— View job PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)Tasks
OverviewThe core termination guard logic is sound for the stated scope (simple request-bounded, non-session, non-PD workloads), and the bundle validation, default pipeline wiring, and Critical — Must Fix Before Merge1.
Every execution of these The PR description scopes this fix to "workloads without batch/background SLO-class requests" but does not mention this broader class of affected workloads. This is a silent regression for those combinations. Minimal fix — change the guard from // cluster_event.go:Execute
cs.pendingArrivals--
if cs.pendingArrivals < 0 {
logrus.Warnf("[cluster] pendingArrivals went negative (%d) — session/PD/drain re-injection without increment; autoscaler guard will use <= 0 semantics", cs.pendingArrivals)
}and update the guard in Full fix — add 2. Bundle cross-validation gap allows runtime panic from one-sided threshold override (R3 violation)
if a.ScaleUpThreshold != 0 && a.ScaleDownBoundary != 0 && a.ScaleDownBoundary >= a.ScaleUpThreshold {A user who writes Fix: in effectiveUp := a.ScaleUpThreshold
if effectiveUp == 0 { effectiveUp = 0.8 }
effectiveDown := a.ScaleDownBoundary
if effectiveDown == 0 { effectiveDown = 0.4 }
if effectiveDown >= effectiveUp {
return fmt.Errorf("autoscaler.analyzer.scale_down_boundary (%v) must be < effective scale_up_threshold (%v)", effectiveDown, effectiveUp)
}Also add a test case to {"scale_down_boundary above default scale_up_threshold", AnalyzerBundleConfig{ScaleDownBoundary: 0.9}, true},
{"scale_up_threshold below default scale_down_boundary", AnalyzerBundleConfig{ScaleUpThreshold: 0.3}, true},Important — Should Fix3. The test confirms agg := cs.AggregatedMetrics()
if agg.CompletedRequests != 10 {
t.Errorf("INV-1: expected 10 completed, got %d (queued=%d, running=%d)",
agg.CompletedRequests, agg.StillQueued, agg.StillRunning)
}4. Stale comment in 5. Stale 6. Bundle node pool validation is incomplete (
A user passing 7. No warning when replay silently drops autoscaler/node_pools bundle sections ( parsedScorerConfigs, _ := resolvePolicies(cmd)If a user passes Minor — Suggestions8. Double blank line ( 9. Structural wiring test ( 10. Zero cooldowns silently used when autoscaler enabled via CLI flag without bundle — 11. Missing Positive Observations
Verdict: Not Ready to MergeTwo correctness bugs must be fixed:
Items 4–7 (stale comments, incomplete node pool validation, silent replay drop) are strong recommendations. Items 8–11 can follow as cleanup. |
Critical fixes: - Add pendingArrivals++ at 5 missing ClusterArrivalEvent push sites (session follow-ups in cluster.go, PD callbacks in detectDecodeCompletions, NodeReadyEvent callbacks in infra_lifecycle_event.go, drain-redirect re-injection) — prevents counter underflow and autoscaler infinite loop for session/PD/drain workloads - Change scheduleNextTick guard from == 0 to <= 0 as safety net - Add INV-1 (CompletedRequests) assertion to TestAutoscaler_RequestBoundedRun_Terminates - Fix bundle cross-validation: substitute effective defaults before scale_down_boundary vs scale_up_threshold cross-check (R3 — one-sided override with default on the other side would panic at runtime) Important fixes: - Delegate node pool bundle validation to full inline check set mirroring NodePoolConfig.IsValid() (min_nodes, initial_nodes, NaN/Inf guards, cost_per_hour, min_nodes <= max_nodes) - Warn in replay when bundle has autoscaler/node_pools sections (silent drop) - Update stale cluster.go comment: default WVA pipeline wired in constructor - Update stale newAutoscalerPipeline doc comment Minor: - Remove double blank line in cluster.go - Add two one-sided-override test cases to TestPolicyBundle_Validate_AnalyzerThresholds Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue and its tracking parent issue. Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess: Correctness and preservation of invariants Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required? |
|
Claude finished @vishakha-ramani's task in 6m 0s —— View job PR Review: fix(autoscaler): termination guard for request-bounded runs (Bug 2)Tasks
ContextThis review covers the full PR diff against Previous Critical Issues: All Resolved ✓1. All 5 active push sites now have
The The guard in 2. Bundle cross-validation gap (one-sided threshold override)
3. INV-1 not verified in regression test
Previous Important Issues: All Resolved ✓
Remaining Minor Observations1. No regression test for autoscaler + session-mode + request-bounded horizon
2. Missing test case for
3. Duplicate default constants between Both files hardcode 4. No log line when autoscaler is enabled via CLI flag without a bundle When 5. No unit test for If the default constants in Positive Observations
Verdict: Ready to MergeAll critical issues from the previous review are resolved. All important issues are resolved. CI is green. The remaining items (observations 1–5 above) are genuine improvements but none are blockers. They are appropriate candidates for follow-up issues or the Bug 3 PR if already in context. |
|
@claude What is this PR fixing? Will the fix work for both non-PD-disaggregated and PD-disaggregated scenarios? Is it respecting the module boundaries and separation of concerns in the project? |
|
Claude finished @sriumcp's task in 2m 19s —— View job Analysis: PR #1024 — What It Fixes, PD Coverage, and Module BoundariesTasks
1. What is this PR fixing?Root cause: In The fix ( if cs.config.Horizon == math.MaxInt64 && cs.pendingArrivals <= 0 {
var inFlight int
for _, v := range cs.inFlightRequests { inFlight += v }
if inFlight == 0 {
return // no more work; don't self-schedule
}
}To make this work, a new
The PR also bundles prerequisite wiring (policy bundle + autoscaler fields, cluster constructor, CLI flag), bundle validation for autoscaler/node-pool config, and a 2. Does the fix work for both non-PD and PD-disaggregated scenarios?Short answer: Yes, correctly for both paths. Here is how each path is covered: Non-PD (standard) pathRequests arrive via PD-disaggregated pathWhen a request is disaggregated, the flow is:
The two PD-specific push sites in
For non-session PD workloads (most common), Infrastructure events (drain, NodeReadyEvent)
Both covered. Dead code note
Safety netThe guard uses 3. Does the PR respect module boundaries and separation of concerns?Yes, with one minor observation. The project's module boundary rule is
Minor observation: The SummaryThis PR correctly fixes the hang for all request-bounded workloads, including PD-disaggregated and session workloads. The autoscaler terminates when there are no more pending arrivals and no in-flight requests on any instance — covering both simple and disaggregated request lifecycles. Module boundaries are respected: the termination logic is fully encapsulated in |
|
@claude you mentioned
Is there a design fix for this, and also other things like this so that variables like this don't get scattered throughout the codebase but can be updated and maintained in a cleaner way. |
|
Claude finished @sriumcp's task in 2m 30s —— View job Design Fix for Scattered Counter SitesTasks
The Core ProblemThe There are currently 6 push sites across 2 files, each with a "// mirror Run() —" reminder. Every future push site (e.g., if The Fix: A Canonical Push MethodAdd a single method to // pushArrival enqueues a ClusterArrivalEvent and increments pendingArrivals atomically.
// All ClusterArrivalEvent pushes MUST go through this method — it is the single
// enforcement point for the pendingArrivals invariant used by scheduleNextTick.
func (cs *ClusterSimulator) pushArrival(req *sim.Request, timeUs int64) {
heap.Push(&cs.clusterEvents, clusterEventEntry{
event: &ClusterArrivalEvent{time: timeUs, request: req},
seqID: cs.nextSeqID(),
})
cs.pendingArrivals++
}Every push site in // before (6 sites, each with a co-change comment):
heap.Push(&cs.clusterEvents, clusterEventEntry{
event: &ClusterArrivalEvent{time: next.ArrivalTime, request: next},
seqID: cs.nextSeqID(),
})
cs.pendingArrivals++ // mirror Run() — session follow-ups must be tracked
// after (1 method, no comment needed):
cs.pushArrival(next, next.ArrivalTime)The decrement stays in This is the direct application of R4 (canonical constructor) extended to operations: just as R4 says "struct literals in exactly one place," the canonical operation pattern says "co-invariant mutations in exactly one method." Any future push site that bypasses The General PrincipleThe
Other places in BLIS where this already applies well or could be tightened:
Broader Applicability: When to Apply This PatternThis is worth capturing as a standard. Any time you find yourself writing: ...that is a signal to instead create a method that does both Y and X, and make the comment a doc comment on the method. The method becomes the contract; the comment becomes the documentation of the contract rather than the only enforcement of it. The Verdict on This PRThis is a follow-up design improvement, not a blocker for this PR. The current implementation is correct (all sites are covered, |
…rival Introduce pushArrival(req, timeUs) as the single enforcement point for the pendingArrivals invariant. Previously, pendingArrivals++ was scattered across 6 call sites in two files with co-change reminder comments. Any future push site that bypasses pushArrival is obviously wrong at review time, without requiring the reviewer to know about pendingArrivals. Also updates the dead-code promoteDeferred() path so it is correct if ever reactivated. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…o pushArrival" This reverts commit 4d76564.
|
The |
Summary
Fixes #1022 — autoscaler-enabled
blis runwith any request-bounded workload hung indefinitely at 100%+ CPU.Part of tracking issue #994.
Merge order: This PR should land first. PR for Bug 3 (#1023) is stacked on top of this branch.
What this PR contains
Wiring (prerequisite, shared with other bug PRs)
Three commits that wire the autoscaler pipeline into the
blis runCLI:feat(bundle): addnode_poolsandautoscalersections toPolicyBundlefeat(cluster): wire default autoscaler pipeline inNewClusterSimulatorfeat(cmd): wire--model-autoscaler-interval-usflag and policy-config → cluster config translationBug 2 fix
scheduleNextTickinsim/cluster/autoscaler.gounconditionally pushed the nextScalingTickEventatnowUs + intervalUs. On request-bounded runs (Horizon == math.MaxInt64), the DES loop terminates only when the event queue is empty — butScalingTickEventkept the queue alive forever.Fix: Stop self-scheduling when
Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0.New field
pendingArrivals intonClusterSimulatortracks the count ofClusterArrivalEvents in the queue at any moment:Run()per initial arrival pushClusterArrivalEvent.Execute()Note: This fix alone is sufficient for workloads without
batch/backgroundSLO-class requests. The companion Bug 3 PR handles the case wherepromoteDeferred()corrupts the counter for batch workloads.Wiring improvements (reviewed once here, shared basis for Bug 3 and Bug 1 PRs)
bundle.Validate(): add range validation forAnalyzerBundleConfigthresholds — catches invalidscale_up_threshold,scale_down_boundary,avg_input_tokens,kv_cache_thresholdat the CLI boundary (R3) before they reachNewV2SaturationAnalyzerpanicsresolvePolicies()refactored to return(*PolicyBundle, []ScorerConfig)— eliminates redundant secondLoadPolicyBundlecall inrunCmd.RunE.gitignore: adddocs/superpowers/entryTest
TestAutoscaler_RequestBoundedRun_Terminates— creates a cluster withModelAutoscalerIntervalUs = 100_000, runs 10 requests, assertsRun()completes within 2s (goroutine +time.Afterguard).TestPolicyBundle_Validate_AnalyzerThresholds— 9-case table test for analyzer config validation.Test plan
go test ./sim/... ./cmd/...passesgo build ./...succeedsTestAutoscaler_RequestBoundedRun_Terminatespasses and completes in < 2sTestPolicyBundle_Validate_AnalyzerThresholds— all 9 cases pass🤖 Generated with Claude Code