Skip to content

[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366

Merged
bveeramani merged 21 commits intomasterfrom
refactor-autoscaler-util
Dec 11, 2025
Merged

[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366
bveeramani merged 21 commits intomasterfrom
refactor-autoscaler-util

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Dec 10, 2025

This PR enables the utilization-based autoscaler to work with GPUs.

Motivation

The current implementation only considers CPU and memory utilization when making scaling decisions. This is based on an outdated assumption that GPUs are only used by actor pools, and we can use actor pool autoscaling to trigger node scale ups.

This assumption doesn't hold anymore. To fix deadlocks, #54902 made actor pool autoscaling respect resource budgets. As a result, the actor pool autoscaler can't implicitly trigger node autoscaling anymore.

Changes

This PR extends the autoscaler to track GPU utilization alongside CPU and memory:

  1. Extract resource utilization calculation into a dedicated abstraction (ResourceUtilizationGauge): This separates the concern of how utilization is measured from how it's used
    for scaling decisions. The abstraction makes the autoscaler more testable and opens the door for alternative utilization strategies (e.g., physical vs logical utilization, different averaging windows).
  2. Include GPU nodes in scaling decisions: Previously, GPU nodes were explicitly filtered out when determining what node types exist in the cluster. Now all worker node types are considered, allowing the autoscaler to request additional GPU nodes when needed.
  3. Add GPU utilization to the scaling threshold check: The autoscaler now triggers scale-up when any of CPU, GPU, or memory utilization exceeds the threshold, rather than only CPU or memory.

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani requested a review from a team as a code owner December 10, 2025 23:30
@bveeramani bveeramani marked this pull request as draft December 10, 2025 23:30
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the cluster autoscaler to support GPU-based scaling. The introduction of a ResourceUtilizationCalculator is a solid design choice that improves modularity and testability. The changes to _NodeResourceSpec and the autoscaler logic correctly incorporate GPU resources, and the test suite has been enhanced to cover this new functionality. I have a few minor suggestions to improve code correctness and clarity.

bveeramani and others added 4 commits December 10, 2025 15:33
Signed-off-by: Balaji Veeramani <[email protected]>
…utoscaler_v2.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <[email protected]>
…calculator.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <[email protected]>
…calculator.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani changed the title [Data] Ensure cluster autoscaler V2 scales nodes with GPUs [Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs Dec 10, 2025
…ject/ray into refactor-autoscaler-util

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Dec 11, 2025
Base automatically changed from new-default-autoscaler to master December 11, 2025 05:21
@bveeramani bveeramani marked this pull request as ready for review December 11, 2025 06:59
@bveeramani bveeramani merged commit 9dec45d into master Dec 11, 2025
5 checks passed
@bveeramani bveeramani deleted the refactor-autoscaler-util branch December 11, 2025 06:59
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…h GPUs (ray-project#59366)

This PR enables the utilization-based autoscaler to work with GPUs.

**Motivation**

The current implementation only considers CPU and memory utilization
when making scaling decisions. This is based on an outdated assumption
that GPUs are only used by actor pools, and we can use actor pool
autoscaling to trigger node scale ups.

This assumption doesn't hold anymore. To fix deadlocks,
ray-project#54902 made actor pool
autoscaling respect resource budgets. As a result, the actor pool
autoscaler can't implicitly trigger node autoscaling anymore.

**Changes**

This PR extends the autoscaler to track GPU utilization alongside CPU
and memory:

1. Extract resource utilization calculation into a dedicated abstraction
(ResourceUtilizationGauge): This separates the concern of how
utilization is measured from how it's used
for scaling decisions. The abstraction makes the autoscaler more
testable and opens the door for alternative utilization strategies
(e.g., physical vs logical utilization, different averaging windows).
2. Include GPU nodes in scaling decisions: Previously, GPU nodes were
explicitly filtered out when determining what node types exist in the
cluster. Now all worker node types are considered, allowing the
autoscaler to request additional GPU nodes when needed.
3. Add GPU utilization to the scaling threshold check: The autoscaler
now triggers scale-up when any of CPU, GPU, or memory utilization
exceeds the threshold, rather than only CPU or memory.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants