[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs by bveeramani · Pull Request #59366 · ray-project/ray

bveeramani · 2025-12-10T23:30:47Z

This PR enables the utilization-based autoscaler to work with GPUs.

Motivation

The current implementation only considers CPU and memory utilization when making scaling decisions. This is based on an outdated assumption that GPUs are only used by actor pools, and we can use actor pool autoscaling to trigger node scale ups.

This assumption doesn't hold anymore. To fix deadlocks, #54902 made actor pool autoscaling respect resource budgets. As a result, the actor pool autoscaler can't implicitly trigger node autoscaling anymore.

Changes

This PR extends the autoscaler to track GPU utilization alongside CPU and memory:

Extract resource utilization calculation into a dedicated abstraction (ResourceUtilizationGauge): This separates the concern of how utilization is measured from how it's used
for scaling decisions. The abstraction makes the autoscaler more testable and opens the door for alternative utilization strategies (e.g., physical vs logical utilization, different averaging windows).
Include GPU nodes in scaling decisions: Previously, GPU nodes were explicitly filtered out when determining what node types exist in the cluster. Now all worker node types are considered, allowing the autoscaler to request additional GPU nodes when needed.
Add GPU utilization to the scaling threshold check: The autoscaler now triggers scale-up when any of CPU, GPU, or memory utilization exceeds the threshold, rather than only CPU or memory.

Signed-off-by: Balaji Veeramani <[email protected]>

gemini-code-assist

Code Review

This pull request effectively refactors the cluster autoscaler to support GPU-based scaling. The introduction of a ResourceUtilizationCalculator is a solid design choice that improves modularity and testability. The changes to _NodeResourceSpec and the autoscaler logic correctly incorporate GPU resources, and the test suite has been enhanced to cover this new functionality. I have a few minor suggestions to improve code correctness and clarity.

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py

python/ray/data/_internal/cluster_autoscaler/resource_utility_calculator.py

Signed-off-by: Balaji Veeramani <[email protected]>

…utoscaler_v2.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>

…calculator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>

Signed-off-by: Balaji Veeramani <[email protected]>

…ject/ray into refactor-autoscaler-util Signed-off-by: Balaji Veeramani <[email protected]>

Signed-off-by: Balaji Veeramani <[email protected]>

…h GPUs (ray-project#59366) This PR enables the utilization-based autoscaler to work with GPUs. **Motivation** The current implementation only considers CPU and memory utilization when making scaling decisions. This is based on an outdated assumption that GPUs are only used by actor pools, and we can use actor pool autoscaling to trigger node scale ups. This assumption doesn't hold anymore. To fix deadlocks, ray-project#54902 made actor pool autoscaling respect resource budgets. As a result, the actor pool autoscaler can't implicitly trigger node autoscaling anymore. **Changes** This PR extends the autoscaler to track GPU utilization alongside CPU and memory: 1. Extract resource utilization calculation into a dedicated abstraction (ResourceUtilizationGauge): This separates the concern of how utilization is measured from how it's used for scaling decisions. The abstraction makes the autoscaler more testable and opens the door for alternative utilization strategies (e.g., physical vs logical utilization, different averaging windows). 2. Include GPU nodes in scaling decisions: Previously, GPU nodes were explicitly filtered out when determining what node types exist in the cluster. Now all worker node types are considered, allowing the autoscaler to request additional GPU nodes when needed. 3. Add GPU utilization to the scaling threshold check: The autoscaler now triggers scale-up when any of CPU, GPU, or memory utilization exceeds the threshold, rather than only CPU or memory. --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <[email protected]>

bveeramani added 7 commits December 10, 2025 11:05

Initial commit

9b7fe22

Signed-off-by: Balaji Veeramani <[email protected]>

Remove dead file

3e536c0

Signed-off-by: Balaji Veeramani <[email protected]>

Rename environment variables

f787584

Signed-off-by: Balaji Veeramani <[email protected]>

Appease lint

a4e1c45

Signed-off-by: Balaji Veeramani <[email protected]>

Initial commit

8300379

Signed-off-by: Balaji Veeramani <[email protected]>

Address review comments

ddd8410

Signed-off-by: Balaji Veeramani <[email protected]>

Initial commit

e850c42

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested a review from a team as a code owner December 10, 2025 23:30

bveeramani marked this pull request as draft December 10, 2025 23:30

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

bveeramani and others added 4 commits December 10, 2025 15:33

Update test

6b2b61e

Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/data/_internal/cluster_autoscaler/default_cluster_a…

9b8e1e4

…utoscaler_v2.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/data/_internal/cluster_autoscaler/resource_utility_…

fcdf444

…calculator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/data/_internal/cluster_autoscaler/resource_utility_…

11a3a7b

…calculator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani changed the title ~~[Data] Ensure cluster autoscaler V2 scales nodes with GPUs~~ [Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs Dec 10, 2025

bveeramani added 2 commits December 10, 2025 15:57

Add test to BAZEL file

06825ce

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'new-default-autoscaler' into refactor-autoscaler-util

edefb20

Signed-off-by: Balaji Veeramani <[email protected]>

iamjustinhsu approved these changes Dec 11, 2025

View reviewed changes

bveeramani added 6 commits December 10, 2025 16:28

Merge branch 'refactor-autoscaler-util' of https://github.com/ray-pro…

8f79567

…ject/ray into refactor-autoscaler-util Signed-off-by: Balaji Veeramani <[email protected]>

Resolve merge conflicts

ea0ef90

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'new-default-autoscaler' into refactor-autoscaler-util

8e01ef5

Signed-off-by: Balaji Veeramani <[email protected]>

Fix tests

84de935

Signed-off-by: Balaji Veeramani <[email protected]>

Update docstring

55291fd

Signed-off-by: Balaji Veeramani <[email protected]>

Rename files

d20ab28

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani mentioned this pull request Dec 11, 2025

[Data] Ray keeps adding nodes beyond Dataset.map concurrency #52573

Closed

bveeramani added the go add ONLY when ready to merge, run all tests label Dec 11, 2025

Base automatically changed from new-default-autoscaler to master December 11, 2025 05:21

bveeramani added 2 commits December 10, 2025 21:22

Merge branch 'master' into refactor-autoscaler-util

ced7d85

Signed-off-by: Balaji Veeramani <[email protected]>

Resolve merge conflicts

cc872db

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani marked this pull request as ready for review December 11, 2025 06:59

bveeramani merged commit 9dec45d into master Dec 11, 2025
5 checks passed

bveeramani deleted the refactor-autoscaler-util branch December 11, 2025 06:59

bveeramani mentioned this pull request Dec 16, 2025

[Data] GPU actor pools don't autoscale on an autoscaling cluster #59144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366

[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366
bveeramani merged 21 commits intomasterfrom
refactor-autoscaler-util

bveeramani commented Dec 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bveeramani commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bveeramani commented Dec 10, 2025 •

edited

Loading