[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366
Merged
bveeramani merged 21 commits intomasterfrom Dec 11, 2025
Merged
[Data][Autoscaler][3/N] Ensure cluster autoscaler V2 scales nodes with GPUs#59366bveeramani merged 21 commits intomasterfrom
bveeramani merged 21 commits intomasterfrom
Conversation
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Contributor
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the cluster autoscaler to support GPU-based scaling. The introduction of a ResourceUtilizationCalculator is a solid design choice that improves modularity and testability. The changes to _NodeResourceSpec and the autoscaler logic correctly incorporate GPU resources, and the test suite has been enhanced to cover this new functionality. I have a few minor suggestions to improve code correctness and clarity.
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/resource_utility_calculator.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <[email protected]>
…utoscaler_v2.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>
…calculator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>
…calculator.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
iamjustinhsu
approved these changes
Dec 11, 2025
…ject/ray into refactor-autoscaler-util Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…h GPUs (ray-project#59366) This PR enables the utilization-based autoscaler to work with GPUs. **Motivation** The current implementation only considers CPU and memory utilization when making scaling decisions. This is based on an outdated assumption that GPUs are only used by actor pools, and we can use actor pool autoscaling to trigger node scale ups. This assumption doesn't hold anymore. To fix deadlocks, ray-project#54902 made actor pool autoscaling respect resource budgets. As a result, the actor pool autoscaler can't implicitly trigger node autoscaling anymore. **Changes** This PR extends the autoscaler to track GPU utilization alongside CPU and memory: 1. Extract resource utilization calculation into a dedicated abstraction (ResourceUtilizationGauge): This separates the concern of how utilization is measured from how it's used for scaling decisions. The abstraction makes the autoscaler more testable and opens the door for alternative utilization strategies (e.g., physical vs logical utilization, different averaging windows). 2. Include GPU nodes in scaling decisions: Previously, GPU nodes were explicitly filtered out when determining what node types exist in the cluster. Now all worker node types are considered, allowing the autoscaler to request additional GPU nodes when needed. 3. Add GPU utilization to the scaling threshold check: The autoscaler now triggers scale-up when any of CPU, GPU, or memory utilization exceeds the threshold, rather than only CPU or memory. --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables the utilization-based autoscaler to work with GPUs.
Motivation
The current implementation only considers CPU and memory utilization when making scaling decisions. This is based on an outdated assumption that GPUs are only used by actor pools, and we can use actor pool autoscaling to trigger node scale ups.
This assumption doesn't hold anymore. To fix deadlocks, #54902 made actor pool autoscaling respect resource budgets. As a result, the actor pool autoscaler can't implicitly trigger node autoscaling anymore.
Changes
This PR extends the autoscaler to track GPU utilization alongside CPU and memory:
for scaling decisions. The abstraction makes the autoscaler more testable and opens the door for alternative utilization strategies (e.g., physical vs logical utilization, different averaging windows).