Skip to content

feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo#995

Draft
vishakha-ramani wants to merge 8 commits intoinference-sim:mainfrom
vishakha-ramani:feat/autoscaler-cli-wiring
Draft

feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo#995
vishakha-ramani wants to merge 8 commits intoinference-sim:mainfrom
vishakha-ramani:feat/autoscaler-cli-wiring

Conversation

@vishakha-ramani
Copy link
Copy Markdown
Contributor

Summary

  • Wires the Phase 1C autoscaler pipeline (DefaultCollector → V2SaturationAnalyzer → UnlimitedEngine → DirectActuator) into blis run via --policy-config.
  • Fixes three bugs that prevented the pipeline from running at all.
  • Adds examples/autoscaler-demo.yaml for a self-contained demo against the spike workload.
  • Adds docs/superpowers/ to .gitignore (agent-local artifacts, not committed).

Closes #994. Part of #696.


Commits

Commit Description
534a41c docs: autoscaler CLI wiring design spec
3cfb9c9 docs: autoscaler CLI wiring implementation plan
9d20da9 feat(bundle): add node_pools and autoscaler sections to PolicyBundle
b6dbe93 feat(cluster): wire default autoscaler pipeline in NewClusterSimulator
8233f29 feat(cmd): wire --model-autoscaler-interval-us flag and policy-config bundle conversion
3375856 fix(autoscaler): three end-to-end bugs + demo + gitignore for superpowers

Bugs Fixed (commit 3375856)

Bug 1 — DirectActuator.scaleUp was a stub

scaleUp called PlaceInstance() (GPU reservation) but never constructed InstanceSimulator
objects. Scale-up logged as successful but produced phantom GPU reservations — no instance ever
entered the routing pool, loaded, or served requests.

Fix: Rewrote scaleUp to mirror NodeReadyEvent.Execute() exactly: construct instance,
transition to Loading, register with CachedSnapshotProvider, schedule loaded event, append to
cs.instances, init inFlightRequests, wire OnRequestDone callback.

Maintenance note: DirectActuator.scaleUp and NodeReadyEvent.Execute() are now parallel
paths. Any new field added to the node-ready creation path must be mirrored here.

Bug 2 — scheduleNextTick had no termination guard

ScalingTickEvent always rescheduled itself unconditionally. On request-bounded runs
(Horizon == math.MaxInt64), the cluster event queue never drains → infinite loop at 100%+ CPU.

Fix: Stop self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0. Added pendingArrivals int to ClusterSimulator, incremented in Run() per
arrival push, decremented in ClusterArrivalEvent.Execute().

Bug 3 — promoteDeferred corrupted pendingArrivals

Deferred batch/background requests re-injected via promoteDeferred() as new
ClusterArrivalEvents without incrementing pendingArrivals. Each decremented the counter on
execution, driving it to -N for N deferred requests. The Bug 2 guard checks == 0 exactly —
never true when negative → hang persisted on any workload with a batch/background cohort.

Fix: Increment pendingArrivals in promoteDeferred() per re-injected arrival.

Invariant: pendingArrivals == count of ClusterArrivalEvents currently in cs.clusterEvents.


Demo

go build -o blis main.go

# Baseline: 1 fixed instance, no autoscaler
./blis run \
  --model meta-llama/Llama-2-7b-hf \
  --latency-model roofline --hardware A100-80 --tp 1 \
  --num-instances 1 \
  --workload-spec examples/regression_workload_load_spikes.yaml \
  --rate 30 --num-requests 3000

# With autoscaler (--log info shows scale events)
./blis run \
  --model meta-llama/Llama-2-7b-hf \
  --latency-model roofline --hardware A100-80 --tp 1 \
  --num-instances 1 \
  --policy-config examples/autoscaler-demo.yaml \
  --workload-spec examples/regression_workload_load_spikes.yaml \
  --rate 30 --num-requests 3000 --log info

Scale events observed:

[actuator] scale-up:   placed autoscale-...-000001 → demo-pool-1 (A100-80)
[actuator] scale-up:   placed autoscale-...-000002 → demo-pool-2 (A100-80)
[actuator] scale-up:   placed autoscale-...-000003 → demo-pool-3 (A100-80)
[actuator] scale-up:   placed autoscale-...-000004 → demo-pool-4 (A100-80)
[actuator] scale-down: draining autoscale-...-000001
[actuator] scale-up:   placed autoscale-...-000005 → demo-pool-5 (A100-80)
[actuator] scale-up:   placed autoscale-...-000006 → demo-pool-6 (A100-80)
[actuator] scale-up:   placed autoscale-...-000007 → demo-pool-7 (A100-80)

Results (3000 req, 30 req/s, Llama-2-7b, A100-80):

Metric Baseline Autoscaler Δ
Throughput 23.1 req/s 48.5 req/s +110%
E2E mean 49,474 ms 32,008 ms -35%
E2E p99 122,381 ms 57,552 ms -53%
TTFT mean 41,825 ms 25,029 ms -40%
TTFT p99 115,336 ms 45,891 ms -60%
Preemptions 84 12 -86%

Test plan

  • go test ./sim/cluster/... passes (43s, all tests green)
  • go build ./... passes
  • Baseline run completes without hanging
  • Autoscaler run completes without hanging (previously hung indefinitely)
  • Scale-up and scale-down events appear in --log info output
  • docs/superpowers/ does not appear in git status

🤖 Generated with Claude Code

vishakha-ramani and others added 7 commits April 9, 2026 10:40
Design document for wiring Phase 1A NodePools and Phase 1C-1b
autoscaler pipeline to the CLI via --policy-config YAML extension.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add AutoscalerAnalyzerConfig to DeploymentConfig and effectiveAnalyzerConfig()
helper for WVA reference defaults. NewClusterSimulator now wires
DefaultCollector+V2SaturationAnalyzer+UnlimitedEngine+DirectActuator when
ModelAutoscalerIntervalUs > 0, so cmd/ only sets config fields.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… bundle conversion

Reads node_pools and autoscaler sections from --policy-config YAML and
converts them into DeploymentConfig. CLI flag --model-autoscaler-interval-us
overrides the YAML value when explicitly set.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…wers

Bug 1 — DirectActuator.scaleUp was a stub (direct_actuator.go):
Only called PlaceInstance() (GPU reservation) but never constructed
InstanceSimulator objects. Scale-up appeared to succeed but produced
phantom GPU reservations — no instance ever entered the routing pool,
loaded, or served requests. Rewrote scaleUp to mirror
NodeReadyEvent.Execute(): construct instance, transition to Loading,
register with CachedSnapshotProvider, schedule loaded event, append to
cs.instances, init inFlightRequests, wire OnRequestDone callback.

Bug 2 — scheduleNextTick had no termination guard (autoscaler.go):
ScalingTickEvent always rescheduled itself unconditionally. On
request-bounded runs (Horizon == MaxInt64), the cluster event queue
never empties → infinite event loop at 100%+ CPU. Added guard: stop
self-scheduling when Horizon == MaxInt64 && pendingArrivals == 0 &&
inFlight == 0. Added pendingArrivals int field to ClusterSimulator,
incremented in Run() per arrival push, decremented in
ClusterArrivalEvent.Execute().

Bug 3 — promoteDeferred corrupted pendingArrivals (cluster.go):
Deferred batch/background requests re-injected via promoteDeferred()
as new ClusterArrivalEvents without incrementing pendingArrivals. Each
re-injected event decremented the counter on execution, driving it to
-N for N deferred requests. The Bug 2 guard checks == 0 exactly —
never true when negative → hang persisted on workloads with batch
cohorts. Fixed by incrementing pendingArrivals per re-injected arrival
in promoteDeferred(), mirroring Run(). Invariant: pendingArrivals ==
count of ClusterArrivalEvents in cs.clusterEvents at all times.

Also adds:
- examples/autoscaler-demo.yaml: self-contained demo config with
  node_pools + autoscaler sections, tuned for regression_workload_load_spikes
- .gitignore: docs/superpowers/ (agent-local, not committed)
- cluster_test.go: remove orphaned merge conflict marker

Closes inference-sim#994

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 9, 2026

Claude finished @vishakha-ramani's task in 8m 44s —— View job


PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Progress:

  • Initialize review
  • Gather context (read changed files, standards, invariants)
  • Analyze correctness, invariants, test coverage, design discipline
  • Post final verdict

Summary

This PR wires the Phase 1C autoscaler pipeline into blis run, fixes three real bugs that caused infinite loops, and adds a demo. CI passes (build + all tests green). The implementation is mostly sound, but there are two issues that should be addressed before merging.


Critical Issues

1. docs/superpowers/ files are committed — contradicts the gitignore entry

The PR adds docs/superpowers/ to .gitignore with the comment "agent-local, not committed." But the same PR commits two files under that path:

  • docs/superpowers/plans/2026-04-08-autoscaler-cli-wiring.md (744 lines)
  • docs/superpowers/specs/2026-04-08-autoscaler-cli-wiring-design.md (136 lines)

Once tracked, .gitignore has no effect on them. These files are now permanently in the repo's git history. The intent (agent-local artifacts should not be committed) is self-contradictory with the actual diff.

Required fix: Either git rm --cached docs/superpowers/plans/... docs/superpowers/specs/... to untrack them, or change the gitignore comment to reflect that these are intentional design docs. The PR description says "Adds docs/superpowers/ to .gitignore (agent-local artifacts, not committed)" — but they are committed.

Fix this →

2. No regression tests for Bug 2 and Bug 3

The PR description identifies these as critical correctness bugs (infinite loops), but no automated regression test verifies the fixes. Per CLAUDE.md BDD/TDD standards, critical bugs must have companion invariant tests.

Missing tests:

  • Bug 2: ModelAutoscalerIntervalUs > 0 on a request-bounded run (Horizon == math.MaxInt64) must terminate.
  • Bug 3 (combined): ModelAutoscalerIntervalUs > 0 + batch/background requests in the workload → pendingArrivals must not go negative, and the run must terminate. Without this, a future modification to promoteDeferred could silently re-introduce the hang.

The two existing tests (TestNewClusterSimulator_AutoscalerWiredWhenEnabled, TestNewClusterSimulator_AutoscalerNilWhenDisabled) only check that the pipeline is wired; they don't validate termination behavior.

Required fix: Add at minimum one test that runs NewClusterSimulator with ModelAutoscalerIntervalUs > 0 and a batch/background request, calls Run(), and asserts it completes without hanging (with a timeout guard in the test, e.g., via a goroutine + time.After).

Fix this →


Important Issues

3. Bundle loaded twice in cmd/root.go

LoadPolicyBundle is called in resolvePolicies() (line 804) and again in runCmd.RunE (line 1398). The comment acknowledges this: "Bundle was already loaded and validated in resolvePolicies(); safe to reload here." This works, but it's fragile — if the file is replaced between the two reads (unlikely but possible), the two loads see different data. A cleaner approach would be to have resolvePolicies() return the bundle as a second return value, or cache the bundle in a package-level variable after validation.

This is not blocking for merge, but the duplicate load should be tracked.

4. AnalyzerBundleConfig threshold values not validated in bundle.Validate()

Validate() checks interval_us, cooldowns, and actuation delay, but does not validate the analyzer thresholds (kv_cache_threshold, scale_up_threshold, scale_down_boundary, avg_input_tokens). Invalid values (e.g., kv_cache_threshold: 2.0) pass bundle validation silently, then cause a panic in NewV2SaturationAnalyzer() at runtime (after the simulation has started). This violates R3: validate at CLI boundary.

The effectiveAnalyzerConfig() defaults protect zero-value fields, so only explicitly-set invalid values reach the panic path. But per R3, the CLI boundary should catch this, not a library panic.

5. DirectActuator.scaleUp and NodeReadyEvent.Execute() are dual maintenance paths

The PR description explicitly flags this: "Any new field added to the node-ready creation path must be mirrored here." This is a latent R4 (construction site audit) violation waiting to happen. The two paths are already subtly different in one place: NodeReadyEvent uses cs.config.Model for inst.Model, while scaleUp uses d.ModelID. This is intentional (multi-model support), but it means the two paths cannot be mechanically kept in sync.

A comment co-locating them for easier auditability would help, or a shared helper function. Not blocking.


Positive Observations

  • Bug fixes are correct and precise: pendingArrivals is incremented in the right places (Run() on push, promoteDeferred() on re-injection, decremented in ClusterArrivalEvent.Execute()). The termination guard in scheduleNextTick correctly requires both pendingArrivals == 0 AND inFlight == 0.
  • scaleUp mirrors the NodeReadyEvent path faithfully: CacheQueryFn registration, OnRequestDone callback wiring, CachedSnapshotProvider.AddInstance, cs.instances append, inFlightRequests init — all present.
  • Bundle validation is comprehensive: Interval, cooldowns, actuation delay, node pool constraints (name, gpu_type, gpus_per_node, gpu_memory_gib, max/initial_nodes relationship) all validated.
  • Backward compatibility preserved: ModelAutoscalerIntervalUs == 0 disables autoscaler (INV-6). Existing policy YAML without autoscaler section parses cleanly (TestLoadPolicyBundle_AutoscalerAbsent_IsZero).
  • DelaySpec units are consistent: Bundle actuation_delay.mean is in seconds, DelaySpec.Sample() converts to µs. The autoscaler-demo.yaml mean: 2.0 = 2s is correct.
  • Demo YAML is well-documented with a tuning guide and clear expected scale events.
  • CI passes (build + all tests green including 43s cluster test suite).

Verdict: Not Ready to Merge

Blocking:

  1. Remove docs/superpowers/ files from git tracking (contradicts gitignore intent and CLAUDE.md agent trust boundaries).
  2. Add termination regression tests for Bug 2 + Bug 3 (CLAUDE.md BDD/TDD requirement; critical bugs need companion invariant tests).

Non-blocking (should be tracked):
3. Bundle loaded twice in cmd/root.go — clean up or document.
4. Add analyzer config range validation to bundle.Validate() (R3).
5. Consider extracting shared instance-construction helper to reduce scaleUp/NodeReadyEvent drift.

…gression tests, validate analyzer config, eliminate double bundle load

- `git rm --cached` the two docs/superpowers/ files that were inadvertently
  committed; .gitignore already lists the directory as agent-local
- Add TestAutoscaler_RequestBoundedRun_Terminates and
  TestAutoscaler_WithBatchRequests_Terminates as regression tests for
  Bug 2 (infinite tick loop) and Bug 3 (pendingArrivals going negative
  from deferred batch re-injection); both must complete in ≤2s
- Add range validation for AnalyzerBundleConfig thresholds in
  bundle.Validate() — catches invalid scale_up_threshold,
  scale_down_boundary, avg_input_tokens, and kv_cache_threshold at the
  CLI boundary (R3) before they reach NewV2SaturationAnalyzer panics;
  add TestPolicyBundle_Validate_AnalyzerThresholds table test
- Refactor resolvePolicies() to return (*sim.PolicyBundle, []sim.ScorerConfig)
  so runCmd.RunE can reuse the loaded+validated bundle instead of calling
  LoadPolicyBundle a second time; replay.go discards the bundle (no autoscaler)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 9, 2026

Claude finished @vishakha-ramani's task in 8m 56s —— View job


PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Progress:

  • Initialize review
  • Gather context (read changed files, check latest commits, verify previous review fixes)
  • Analyze correctness, invariants, test coverage, design discipline
  • Post final verdict

Status of Previous Review's Blocking Issues

The latest commit (97a1640) addresses all four issues raised in the previous review:

Previously Blocking — now resolved:

# Issue Fix
1 docs/superpowers/ files committed (contradicted gitignore) git rm --cached removed both files; directory doesn't exist on disk
2 No regression tests for Bug 2 and Bug 3 TestAutoscaler_RequestBoundedRun_Terminates + TestAutoscaler_WithBatchRequests_Terminates added with 2s timeout guards
3 Bundle loaded twice in cmd/root.go resolvePolicies() now returns (*sim.PolicyBundle, []sim.ScorerConfig); runCmd.RunE reuses the loaded bundle; replay.go discards with _
4 AnalyzerBundleConfig thresholds not validated (R3 violation) Comprehensive range validation added in bundle.Validate() with TestPolicyBundle_Validate_AnalyzerThresholds table test

Fresh Review of All Changes

Correctness

Bug fixes verified:

  • Bug 2 (infinite tick): scheduleNextTick guard at autoscaler.go:284 correctly stops self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0. Logic is sound.
  • Bug 3 (negative pendingArrivals): pendingArrivals is now incremented in exactly two places: Run() (line 513) per initial arrival push, and promoteDeferred() (line 1083) per re-injected arrival. Decremented in ClusterArrivalEvent.Execute() (line 106). The counter correctly tracks the invariant: pendingArrivals == count of ClusterArrivalEvents currently in cs.clusterEvents.
  • Bug 1 (scaleUp stub): DirectActuator.scaleUp now mirrors NodeReadyEvent.Execute() faithfully: construct instance → TransitionTo(Loading)csp.AddInstancescheduleInstanceLoadedEventcs.instances append → cs.inFlightRequests init → cacheQueryFn registration → OnRequestDone wiring.

Analyzer threshold validation is consistent with panics: bundle.Validate() uses the zero-means-default sentinel (a.KVCacheThreshold != 0 && invalid) matching effectiveAnalyzerConfig's zero-fill logic. Explicitly-set invalid values are caught at the CLI boundary (R3) before reaching NewV2SaturationAnalyzer's panics.

INV-1 preservation: TestAutoscaler_WithBatchRequests_Terminates includes a partial conservation check (completed + queued + running + dropped + timedout == len(base)). The missing fields (RoutingRejections, GatewayQueueDepth, GatewayQueueShed, DeferredHorizonInterrupted) are all zero in the test setup (no gateway queue, request-bounded run, deferred requests fully promoted). The partial check is appropriate for this regression guard.

INV-6 preserved: ModelAutoscalerIntervalUs == 0 leaves cs.autoscaler = nil; existing tests without autoscaler config are unaffected.

Minor Findings (non-blocking)

Variable shadowing in direct_actuator.go:90:

for i := 0; i < d.Delta; i++ {          // outer i
    ...
    for i := range cs.config.NodePools { // inner i shadows outer

The logic is correct (inner i only indexes NodePools), and golangci-lint v2.9.0 passes (shadow checker disabled by default). NodeReadyEvent.Execute avoids this with idx for the outer loop. Not worth blocking merge.

Session follow-ups not tracked in pendingArrivals:
OnRequestDone closures in both NewClusterSimulator and scaleUp push ClusterArrivalEvents without incrementing pendingArrivals. This matches the pre-existing behavior in NewClusterSimulator (line 441) and the PD path (lines 899, 924). If autoscaler + session mode are combined, the autoscaler could stop ticking slightly early (when the last initial request completes, before follow-ups start). This doesn't cause hangs or incorrect results — it's an edge case for an unsupported combination. Consistent behavior across all three construction paths.

Design Discipline

  • resolvePolicies() signature change is clean — no data duplication, replay correctly discards bundle.
  • effectiveAnalyzerConfig() is correctly placed as a pure function in cluster.go (not in cmd/), accessible to tests.
  • The NodePoolBundleConfig/DelayBundleSpec mirror types in sim/bundle.go are the right pattern to avoid sim → sim/cluster circular import. Commented accordingly.
  • --model-autoscaler-interval-us flag override takes precedence over bundle value when cmd.Flags().Changed(...) is set — correct R18 (CLI flags override YAML) behavior.

Test Coverage

New tests verify:

  • TestLoadPolicyBundle_AutoscalerSection — round-trip parse of autoscaler YAML fields
  • TestLoadPolicyBundle_NodePoolsSection — round-trip parse of node pool YAML fields
  • TestLoadPolicyBundle_AutoscalerAbsent_IsZero — backward compat: bundle without autoscaler section parses cleanly
  • TestPolicyBundle_Validate_AutoscalerNegativeInterval — negative interval rejected
  • TestPolicyBundle_Validate_AnalyzerThresholds (table, 9 cases) — CLI boundary validation
  • TestPolicyBundle_Validate_NodePool_MissingName / _MaxLessThanInitial — node pool validation
  • TestNewClusterSimulator_AutoscalerWiredWhenEnabled / NilWhenDisabled — pipeline wiring
  • TestAutoscaler_RequestBoundedRun_Terminates — Bug 2 regression (2s timeout guard)
  • TestAutoscaler_WithBatchRequests_Terminates + INV-1 check — Bug 3 regression (2s timeout guard)
  • TestRunCmd_ModelAutoscalerIntervalUs_FlagRegistered — flag registered with correct default

Coverage is thorough for the new surfaces. All tests in sim/cluster/... pass (CI green).

Documentation

examples/autoscaler-demo.yaml is well-documented with a tuning guide, expected log output markers, and clear comment about UnlimitedEngine ignoring inventory (honest about the limitation). Design spec and implementation plan were correctly removed from the tracked tree.


Verdict: Ready to Merge ✅

All four blocking issues from the prior review are resolved. CI passes (build + lint + all tests). The two minor new findings (variable shadowing, session+autoscaler edge case) are non-blocking — the logic is correct in both cases and the behavior is consistent with the rest of the codebase.


@vishakha-ramani
Copy link
Copy Markdown
Contributor Author

Per the review request, the three bug fixes have been broken out into sub-issues:

# Bug Sub-issue
1 DirectActuator.scaleUp only reserved GPU slots, never created live instances #1021
2 scheduleNextTick had no termination guard — request-bounded runs hung indefinitely #1022
3 promoteDeferred did not increment pendingArrivals — termination guard silently broken for batch workloads #1023

Each sub-issue contains a full description of the symptom, root cause, proposed fix (with before/after code), affected files, and the regression test that guards against recurrence.

The intent is to restructure this PR into three smaller PRs — one per bug fix — to make each change individually reviewable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(autoscaler): end-to-end CLI demo — bugs found, fixes, and experiment results

1 participant