feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo by vishakha-ramani · Pull Request #995 · inference-sim/inference-sim

vishakha-ramani · 2026-04-09T15:53:05Z

Summary

Wires the Phase 1C autoscaler pipeline (DefaultCollector → V2SaturationAnalyzer → UnlimitedEngine → DirectActuator) into blis run via --policy-config.
Fixes three bugs that prevented the pipeline from running at all.
Adds examples/autoscaler-demo.yaml for a self-contained demo against the spike workload.
Adds docs/superpowers/ to .gitignore (agent-local artifacts, not committed).

Closes #994. Part of #696.

Commits

Commit	Description
`534a41c`	docs: autoscaler CLI wiring design spec
`3cfb9c9`	docs: autoscaler CLI wiring implementation plan
`9d20da9`	feat(bundle): add `node_pools` and `autoscaler` sections to `PolicyBundle`
`b6dbe93`	feat(cluster): wire default autoscaler pipeline in `NewClusterSimulator`
`8233f29`	feat(cmd): wire `--model-autoscaler-interval-us` flag and policy-config bundle conversion
`3375856`	fix(autoscaler): three end-to-end bugs + demo + gitignore for superpowers

Bugs Fixed (commit `3375856`)

Bug 1 — `DirectActuator.scaleUp` was a stub

scaleUp called PlaceInstance() (GPU reservation) but never constructed InstanceSimulator
objects. Scale-up logged as successful but produced phantom GPU reservations — no instance ever
entered the routing pool, loaded, or served requests.

Fix: Rewrote scaleUp to mirror NodeReadyEvent.Execute() exactly: construct instance,
transition to Loading, register with CachedSnapshotProvider, schedule loaded event, append to
cs.instances, init inFlightRequests, wire OnRequestDone callback.

Maintenance note: DirectActuator.scaleUp and NodeReadyEvent.Execute() are now parallel
paths. Any new field added to the node-ready creation path must be mirrored here.

Bug 2 — `scheduleNextTick` had no termination guard

ScalingTickEvent always rescheduled itself unconditionally. On request-bounded runs
(Horizon == math.MaxInt64), the cluster event queue never drains → infinite loop at 100%+ CPU.

Fix: Stop self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0. Added pendingArrivals int to ClusterSimulator, incremented in Run() per
arrival push, decremented in ClusterArrivalEvent.Execute().

Bug 3 — `promoteDeferred` corrupted `pendingArrivals`

Deferred batch/background requests re-injected via promoteDeferred() as new
ClusterArrivalEvents without incrementing pendingArrivals. Each decremented the counter on
execution, driving it to -N for N deferred requests. The Bug 2 guard checks == 0 exactly —
never true when negative → hang persisted on any workload with a batch/background cohort.

Fix: Increment pendingArrivals in promoteDeferred() per re-injected arrival.

Invariant: pendingArrivals == count of ClusterArrivalEvents currently in cs.clusterEvents.

Demo

go build -o blis main.go

# Baseline: 1 fixed instance, no autoscaler
./blis run \
  --model meta-llama/Llama-2-7b-hf \
  --latency-model roofline --hardware A100-80 --tp 1 \
  --num-instances 1 \
  --workload-spec examples/regression_workload_load_spikes.yaml \
  --rate 30 --num-requests 3000

# With autoscaler (--log info shows scale events)
./blis run \
  --model meta-llama/Llama-2-7b-hf \
  --latency-model roofline --hardware A100-80 --tp 1 \
  --num-instances 1 \
  --policy-config examples/autoscaler-demo.yaml \
  --workload-spec examples/regression_workload_load_spikes.yaml \
  --rate 30 --num-requests 3000 --log info

Scale events observed:

[actuator] scale-up:   placed autoscale-...-000001 → demo-pool-1 (A100-80)
[actuator] scale-up:   placed autoscale-...-000002 → demo-pool-2 (A100-80)
[actuator] scale-up:   placed autoscale-...-000003 → demo-pool-3 (A100-80)
[actuator] scale-up:   placed autoscale-...-000004 → demo-pool-4 (A100-80)
[actuator] scale-down: draining autoscale-...-000001
[actuator] scale-up:   placed autoscale-...-000005 → demo-pool-5 (A100-80)
[actuator] scale-up:   placed autoscale-...-000006 → demo-pool-6 (A100-80)
[actuator] scale-up:   placed autoscale-...-000007 → demo-pool-7 (A100-80)

Results (3000 req, 30 req/s, Llama-2-7b, A100-80):

Metric	Baseline	Autoscaler	Δ
Throughput	23.1 req/s	48.5 req/s	+110%
E2E mean	49,474 ms	32,008 ms	-35%
E2E p99	122,381 ms	57,552 ms	-53%
TTFT mean	41,825 ms	25,029 ms	-40%
TTFT p99	115,336 ms	45,891 ms	-60%
Preemptions	84	12	-86%

Test plan

go test ./sim/cluster/... passes (43s, all tests green)
go build ./... passes
Baseline run completes without hanging
Autoscaler run completes without hanging (previously hung indefinitely)
Scale-up and scale-down events appear in --log info output
docs/superpowers/ does not appear in git status

🤖 Generated with Claude Code

Design document for wiring Phase 1A NodePools and Phase 1C-1b autoscaler pipeline to the CLI via --policy-config YAML extension. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add AutoscalerAnalyzerConfig to DeploymentConfig and effectiveAnalyzerConfig() helper for WVA reference defaults. NewClusterSimulator now wires DefaultCollector+V2SaturationAnalyzer+UnlimitedEngine+DirectActuator when ModelAutoscalerIntervalUs > 0, so cmd/ only sets config fields. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

… bundle conversion Reads node_pools and autoscaler sections from --policy-config YAML and converts them into DeploymentConfig. CLI flag --model-autoscaler-interval-us overrides the YAML value when explicitly set. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…wers Bug 1 — DirectActuator.scaleUp was a stub (direct_actuator.go): Only called PlaceInstance() (GPU reservation) but never constructed InstanceSimulator objects. Scale-up appeared to succeed but produced phantom GPU reservations — no instance ever entered the routing pool, loaded, or served requests. Rewrote scaleUp to mirror NodeReadyEvent.Execute(): construct instance, transition to Loading, register with CachedSnapshotProvider, schedule loaded event, append to cs.instances, init inFlightRequests, wire OnRequestDone callback. Bug 2 — scheduleNextTick had no termination guard (autoscaler.go): ScalingTickEvent always rescheduled itself unconditionally. On request-bounded runs (Horizon == MaxInt64), the cluster event queue never empties → infinite event loop at 100%+ CPU. Added guard: stop self-scheduling when Horizon == MaxInt64 && pendingArrivals == 0 && inFlight == 0. Added pendingArrivals int field to ClusterSimulator, incremented in Run() per arrival push, decremented in ClusterArrivalEvent.Execute(). Bug 3 — promoteDeferred corrupted pendingArrivals (cluster.go): Deferred batch/background requests re-injected via promoteDeferred() as new ClusterArrivalEvents without incrementing pendingArrivals. Each re-injected event decremented the counter on execution, driving it to -N for N deferred requests. The Bug 2 guard checks == 0 exactly — never true when negative → hang persisted on workloads with batch cohorts. Fixed by incrementing pendingArrivals per re-injected arrival in promoteDeferred(), mirroring Run(). Invariant: pendingArrivals == count of ClusterArrivalEvents in cs.clusterEvents at all times. Also adds: - examples/autoscaler-demo.yaml: self-contained demo config with node_pools + autoscaler sections, tuned for regression_workload_load_spikes - .gitignore: docs/superpowers/ (agent-local, not committed) - cluster_test.go: remove orphaned merge conflict marker Closes inference-sim#994 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

vishakha-ramani · 2026-04-09T15:56:01Z

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

claude · 2026-04-09T15:56:31Z

Claude finished @vishakha-ramani's task in 8m 44s —— View job

PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Progress:

Initialize review
Gather context (read changed files, standards, invariants)
Analyze correctness, invariants, test coverage, design discipline
Post final verdict

Summary

This PR wires the Phase 1C autoscaler pipeline into blis run, fixes three real bugs that caused infinite loops, and adds a demo. CI passes (build + all tests green). The implementation is mostly sound, but there are two issues that should be addressed before merging.

Critical Issues

1. `docs/superpowers/` files are committed — contradicts the gitignore entry

The PR adds docs/superpowers/ to .gitignore with the comment "agent-local, not committed." But the same PR commits two files under that path:

docs/superpowers/plans/2026-04-08-autoscaler-cli-wiring.md (744 lines)
docs/superpowers/specs/2026-04-08-autoscaler-cli-wiring-design.md (136 lines)

Once tracked, .gitignore has no effect on them. These files are now permanently in the repo's git history. The intent (agent-local artifacts should not be committed) is self-contradictory with the actual diff.

Required fix: Either git rm --cached docs/superpowers/plans/... docs/superpowers/specs/... to untrack them, or change the gitignore comment to reflect that these are intentional design docs. The PR description says "Adds docs/superpowers/ to .gitignore (agent-local artifacts, not committed)" — but they are committed.

Fix this →

2. No regression tests for Bug 2 and Bug 3

The PR description identifies these as critical correctness bugs (infinite loops), but no automated regression test verifies the fixes. Per CLAUDE.md BDD/TDD standards, critical bugs must have companion invariant tests.

Missing tests:

Bug 2: ModelAutoscalerIntervalUs > 0 on a request-bounded run (Horizon == math.MaxInt64) must terminate.
Bug 3 (combined): ModelAutoscalerIntervalUs > 0 + batch/background requests in the workload → pendingArrivals must not go negative, and the run must terminate. Without this, a future modification to promoteDeferred could silently re-introduce the hang.

The two existing tests (TestNewClusterSimulator_AutoscalerWiredWhenEnabled, TestNewClusterSimulator_AutoscalerNilWhenDisabled) only check that the pipeline is wired; they don't validate termination behavior.

Required fix: Add at minimum one test that runs NewClusterSimulator with ModelAutoscalerIntervalUs > 0 and a batch/background request, calls Run(), and asserts it completes without hanging (with a timeout guard in the test, e.g., via a goroutine + time.After).

Fix this →

Important Issues

3. Bundle loaded twice in `cmd/root.go`

LoadPolicyBundle is called in resolvePolicies() (line 804) and again in runCmd.RunE (line 1398). The comment acknowledges this: "Bundle was already loaded and validated in resolvePolicies(); safe to reload here." This works, but it's fragile — if the file is replaced between the two reads (unlikely but possible), the two loads see different data. A cleaner approach would be to have resolvePolicies() return the bundle as a second return value, or cache the bundle in a package-level variable after validation.

This is not blocking for merge, but the duplicate load should be tracked.

4. `AnalyzerBundleConfig` threshold values not validated in `bundle.Validate()`

Validate() checks interval_us, cooldowns, and actuation delay, but does not validate the analyzer thresholds (kv_cache_threshold, scale_up_threshold, scale_down_boundary, avg_input_tokens). Invalid values (e.g., kv_cache_threshold: 2.0) pass bundle validation silently, then cause a panic in NewV2SaturationAnalyzer() at runtime (after the simulation has started). This violates R3: validate at CLI boundary.

The effectiveAnalyzerConfig() defaults protect zero-value fields, so only explicitly-set invalid values reach the panic path. But per R3, the CLI boundary should catch this, not a library panic.

5. `DirectActuator.scaleUp` and `NodeReadyEvent.Execute()` are dual maintenance paths

The PR description explicitly flags this: "Any new field added to the node-ready creation path must be mirrored here." This is a latent R4 (construction site audit) violation waiting to happen. The two paths are already subtly different in one place: NodeReadyEvent uses cs.config.Model for inst.Model, while scaleUp uses d.ModelID. This is intentional (multi-model support), but it means the two paths cannot be mechanically kept in sync.

A comment co-locating them for easier auditability would help, or a shared helper function. Not blocking.

Positive Observations

Bug fixes are correct and precise: pendingArrivals is incremented in the right places (Run() on push, promoteDeferred() on re-injection, decremented in ClusterArrivalEvent.Execute()). The termination guard in scheduleNextTick correctly requires both pendingArrivals == 0 AND inFlight == 0.
scaleUp mirrors the NodeReadyEvent path faithfully: CacheQueryFn registration, OnRequestDone callback wiring, CachedSnapshotProvider.AddInstance, cs.instances append, inFlightRequests init — all present.
Bundle validation is comprehensive: Interval, cooldowns, actuation delay, node pool constraints (name, gpu_type, gpus_per_node, gpu_memory_gib, max/initial_nodes relationship) all validated.
Backward compatibility preserved: ModelAutoscalerIntervalUs == 0 disables autoscaler (INV-6). Existing policy YAML without autoscaler section parses cleanly (TestLoadPolicyBundle_AutoscalerAbsent_IsZero).
DelaySpec units are consistent: Bundle actuation_delay.mean is in seconds, DelaySpec.Sample() converts to µs. The autoscaler-demo.yaml mean: 2.0 = 2s is correct.
Demo YAML is well-documented with a tuning guide and clear expected scale events.
CI passes (build + all tests green including 43s cluster test suite).

Verdict: Not Ready to Merge

Blocking:

Remove docs/superpowers/ files from git tracking (contradicts gitignore intent and CLAUDE.md agent trust boundaries).
Add termination regression tests for Bug 2 + Bug 3 (CLAUDE.md BDD/TDD requirement; critical bugs need companion invariant tests).

Non-blocking (should be tracked):
3. Bundle loaded twice in cmd/root.go — clean up or document.
4. Add analyzer config range validation to bundle.Validate() (R3).
5. Consider extracting shared instance-construction helper to reduce scaleUp/NodeReadyEvent drift.

…gression tests, validate analyzer config, eliminate double bundle load - `git rm --cached` the two docs/superpowers/ files that were inadvertently committed; .gitignore already lists the directory as agent-local - Add TestAutoscaler_RequestBoundedRun_Terminates and TestAutoscaler_WithBatchRequests_Terminates as regression tests for Bug 2 (infinite tick loop) and Bug 3 (pendingArrivals going negative from deferred batch re-injection); both must complete in ≤2s - Add range validation for AnalyzerBundleConfig thresholds in bundle.Validate() — catches invalid scale_up_threshold, scale_down_boundary, avg_input_tokens, and kv_cache_threshold at the CLI boundary (R3) before they reach NewV2SaturationAnalyzer panics; add TestPolicyBundle_Validate_AnalyzerThresholds table test - Refactor resolvePolicies() to return (*sim.PolicyBundle, []sim.ScorerConfig) so runCmd.RunE can reuse the loaded+validated bundle instead of calling LoadPolicyBundle a second time; replay.go discards the bundle (no autoscaler) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

vishakha-ramani · 2026-04-09T20:06:55Z

@claude /pr-review-toolkit:review-pr Please perform a thorough review of this PR with respect to both the original issue.

Evaluate whether the PR fully and correctly addresses all requirements and reviewer concerns. In particular, assess:

Correctness and preservation of invariants
Separation of concerns and overall design discipline
Modularity and clarity of API boundaries/contracts
Behavioral integrity, including both behavioral and non-structural tests
Test coverage and quality (not just structure, but meaningful validation of behavior)
Performance implications and potential regressions
Adherence to our @docs/contributing/standards,
Documentation quality, completeness, and accuracy (both user-facing and developer-facing)
All reviews and comments in this PR are addressed
Also identify any risks, edge cases, or missing considerations.

Finally, provide a clear verdict: Is this PR ready to merge? If not, what specific changes are required?

claude · 2026-04-09T20:07:20Z

Claude finished @vishakha-ramani's task in 8m 56s —— View job

PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Progress:

Initialize review
Gather context (read changed files, check latest commits, verify previous review fixes)
Analyze correctness, invariants, test coverage, design discipline
Post final verdict

Status of Previous Review's Blocking Issues

The latest commit (97a1640) addresses all four issues raised in the previous review:

Previously Blocking — now resolved:

#	Issue	Fix
1	`docs/superpowers/` files committed (contradicted gitignore)	`git rm --cached` removed both files; directory doesn't exist on disk
2	No regression tests for Bug 2 and Bug 3	`TestAutoscaler_RequestBoundedRun_Terminates` + `TestAutoscaler_WithBatchRequests_Terminates` added with 2s timeout guards
3	Bundle loaded twice in `cmd/root.go`	`resolvePolicies()` now returns `(*sim.PolicyBundle, []sim.ScorerConfig)`; `runCmd.RunE` reuses the loaded bundle; `replay.go` discards with `_`
4	`AnalyzerBundleConfig` thresholds not validated (R3 violation)	Comprehensive range validation added in `bundle.Validate()` with `TestPolicyBundle_Validate_AnalyzerThresholds` table test

Fresh Review of All Changes

Correctness

Bug fixes verified:

Bug 2 (infinite tick): scheduleNextTick guard at autoscaler.go:284 correctly stops self-scheduling when Horizon == math.MaxInt64 && pendingArrivals == 0 && inFlight == 0. Logic is sound.
Bug 3 (negative pendingArrivals): pendingArrivals is now incremented in exactly two places: Run() (line 513) per initial arrival push, and promoteDeferred() (line 1083) per re-injected arrival. Decremented in ClusterArrivalEvent.Execute() (line 106). The counter correctly tracks the invariant: pendingArrivals == count of ClusterArrivalEvents currently in cs.clusterEvents.
Bug 1 (scaleUp stub): DirectActuator.scaleUp now mirrors NodeReadyEvent.Execute() faithfully: construct instance → TransitionTo(Loading) → csp.AddInstance → scheduleInstanceLoadedEvent → cs.instances append → cs.inFlightRequests init → cacheQueryFn registration → OnRequestDone wiring.

Analyzer threshold validation is consistent with panics: bundle.Validate() uses the zero-means-default sentinel (a.KVCacheThreshold != 0 && invalid) matching effectiveAnalyzerConfig's zero-fill logic. Explicitly-set invalid values are caught at the CLI boundary (R3) before reaching NewV2SaturationAnalyzer's panics.

INV-1 preservation: TestAutoscaler_WithBatchRequests_Terminates includes a partial conservation check (completed + queued + running + dropped + timedout == len(base)). The missing fields (RoutingRejections, GatewayQueueDepth, GatewayQueueShed, DeferredHorizonInterrupted) are all zero in the test setup (no gateway queue, request-bounded run, deferred requests fully promoted). The partial check is appropriate for this regression guard.

INV-6 preserved: ModelAutoscalerIntervalUs == 0 leaves cs.autoscaler = nil; existing tests without autoscaler config are unaffected.

Minor Findings (non-blocking)

Variable shadowing in direct_actuator.go:90:

for i := 0; i < d.Delta; i++ {          // outer i
    ...
    for i := range cs.config.NodePools { // inner i shadows outer

The logic is correct (inner i only indexes NodePools), and golangci-lint v2.9.0 passes (shadow checker disabled by default). NodeReadyEvent.Execute avoids this with idx for the outer loop. Not worth blocking merge.

Session follow-ups not tracked in pendingArrivals:
OnRequestDone closures in both NewClusterSimulator and scaleUp push ClusterArrivalEvents without incrementing pendingArrivals. This matches the pre-existing behavior in NewClusterSimulator (line 441) and the PD path (lines 899, 924). If autoscaler + session mode are combined, the autoscaler could stop ticking slightly early (when the last initial request completes, before follow-ups start). This doesn't cause hangs or incorrect results — it's an edge case for an unsupported combination. Consistent behavior across all three construction paths.

Design Discipline

resolvePolicies() signature change is clean — no data duplication, replay correctly discards bundle.
effectiveAnalyzerConfig() is correctly placed as a pure function in cluster.go (not in cmd/), accessible to tests.
The NodePoolBundleConfig/DelayBundleSpec mirror types in sim/bundle.go are the right pattern to avoid sim → sim/cluster circular import. Commented accordingly.
--model-autoscaler-interval-us flag override takes precedence over bundle value when cmd.Flags().Changed(...) is set — correct R18 (CLI flags override YAML) behavior.

Test Coverage

New tests verify:

TestLoadPolicyBundle_AutoscalerSection — round-trip parse of autoscaler YAML fields
TestLoadPolicyBundle_NodePoolsSection — round-trip parse of node pool YAML fields
TestLoadPolicyBundle_AutoscalerAbsent_IsZero — backward compat: bundle without autoscaler section parses cleanly
TestPolicyBundle_Validate_AutoscalerNegativeInterval — negative interval rejected
TestPolicyBundle_Validate_AnalyzerThresholds (table, 9 cases) — CLI boundary validation
TestPolicyBundle_Validate_NodePool_MissingName / _MaxLessThanInitial — node pool validation
TestNewClusterSimulator_AutoscalerWiredWhenEnabled / NilWhenDisabled — pipeline wiring
TestAutoscaler_RequestBoundedRun_Terminates — Bug 2 regression (2s timeout guard)
TestAutoscaler_WithBatchRequests_Terminates + INV-1 check — Bug 3 regression (2s timeout guard)
TestRunCmd_ModelAutoscalerIntervalUs_FlagRegistered — flag registered with correct default

Coverage is thorough for the new surfaces. All tests in sim/cluster/... pass (CI green).

Documentation

examples/autoscaler-demo.yaml is well-documented with a tuning guide, expected log output markers, and clear comment about UnlimitedEngine ignoring inventory (honest about the limitation). Design spec and implementation plan were correctly removed from the tracked tree.

Verdict: Ready to Merge ✅

All four blocking issues from the prior review are resolved. CI passes (build + lint + all tests). The two minor new findings (variable shadowing, session+autoscaler edge case) are non-blocking — the logic is correct in both cases and the behavior is consistent with the rest of the codebase.

vishakha-ramani · 2026-04-10T22:05:57Z

Per the review request, the three bug fixes have been broken out into sub-issues:

#	Bug	Sub-issue
1	`DirectActuator.scaleUp` only reserved GPU slots, never created live instances	#1021
2	`scheduleNextTick` had no termination guard — request-bounded runs hung indefinitely	#1022
3	`promoteDeferred` did not increment `pendingArrivals` — termination guard silently broken for batch workloads	#1023

Each sub-issue contains a full description of the symptom, root cause, proposed fix (with before/after code), affected files, and the regression test that guards against recurrence.

The intent is to restructure this PR into three smaller PRs — one per bug fix — to make each change individually reviewable.

vishakha-ramani and others added 7 commits April 9, 2026 10:40

docs: add autoscaler CLI wiring design spec

534a41c

Design document for wiring Phase 1A NodePools and Phase 1C-1b autoscaler pipeline to the CLI via --policy-config YAML extension. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

docs: add autoscaler CLI wiring implementation plan

3cfb9c9

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

feat(bundle): add node_pools and autoscaler sections to PolicyBundle

9d20da9

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Merge branch 'main' into feat/autoscaler-cli-wiring

d064b5a

vishakha-ramani marked this pull request as draft April 10, 2026 22:12

This was referenced Apr 10, 2026

fix(autoscaler): termination guard for request-bounded runs (Bug 2) #1024

Merged

fix(autoscaler): promoteDeferred must increment pendingArrivals (Bug 3) #1025

Closed

vishakha-ramani mentioned this pull request Apr 21, 2026

fix(autoscaler): replace scale-up/down cooldowns with HPA-aligned stabilization windows + rename ActuationDelay #1108

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo#995

feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo#995
vishakha-ramani wants to merge 8 commits intoinference-sim:mainfrom
vishakha-ramani:feat/autoscaler-cli-wiring

vishakha-ramani commented Apr 9, 2026

Uh oh!

vishakha-ramani commented Apr 9, 2026

Uh oh!

claude Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

vishakha-ramani commented Apr 9, 2026

Uh oh!

claude Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

vishakha-ramani commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vishakha-ramani commented Apr 9, 2026

Summary

Commits

Bugs Fixed (commit 3375856)

Bug 1 — DirectActuator.scaleUp was a stub

Bug 2 — scheduleNextTick had no termination guard

Bug 3 — promoteDeferred corrupted pendingArrivals

Demo

Test plan

Uh oh!

vishakha-ramani commented Apr 9, 2026

Uh oh!

claude Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Summary

Critical Issues

1. docs/superpowers/ files are committed — contradicts the gitignore entry

2. No regression tests for Bug 2 and Bug 3

Important Issues

3. Bundle loaded twice in cmd/root.go

4. AnalyzerBundleConfig threshold values not validated in bundle.Validate()

5. DirectActuator.scaleUp and NodeReadyEvent.Execute() are dual maintenance paths

Positive Observations

Verdict: Not Ready to Merge

Uh oh!

vishakha-ramani commented Apr 9, 2026

Uh oh!

claude Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(autoscaler): Phase 1C CLI wiring + three end-to-end bug fixes + demo

Status of Previous Review's Blocking Issues

Fresh Review of All Changes

Correctness

Minor Findings (non-blocking)

Design Discipline

Test Coverage

Documentation

Verdict: Ready to Merge ✅

Uh oh!

vishakha-ramani commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bugs Fixed (commit `3375856`)

Bug 1 — `DirectActuator.scaleUp` was a stub

Bug 2 — `scheduleNextTick` had no termination guard

Bug 3 — `promoteDeferred` corrupted `pendingArrivals`

claude Bot commented Apr 9, 2026 •

edited

Loading

1. `docs/superpowers/` files are committed — contradicts the gitignore entry

3. Bundle loaded twice in `cmd/root.go`

4. `AnalyzerBundleConfig` threshold values not validated in `bundle.Validate()`

5. `DirectActuator.scaleUp` and `NodeReadyEvent.Execute()` are dual maintenance paths

claude Bot commented Apr 9, 2026 •

edited

Loading