[Data] - Fix GPU autoscaling if max_actors is set by goutamvenkat-anyscale · Pull Request #59632 · ray-project/ray

goutamvenkat-anyscale · 2025-12-23T17:27:00Z

Description

Previously, GPU allocation was special-cased in ReservationOpResourceAllocator:

GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage
The check max_resource_usage != inf() and max_resource_usage.gpu > 0 failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget
GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor.

Changes
task_pool_map_operator.py

Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator
Returns (min=1 task resources, max=max_concurrency * task resources or inf)

resource_manager.py

Removed GPU special-casing entirely
GPU now flows through the same allocation path as CPU and memory
Operators are capped by their max_resource_usage for all resource types uniformly
Remaining shared resources (including GPU) go to unbounded downstream operators

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale · 2025-12-23T17:29:00Z

/gemini summary

gemini-code-assist

Code Review

This pull request fixes an issue with GPU autoscaling when max_actors is set by correctly calculating the maximum resource requirements in ActorPoolMapOperator. The change introduces logic to compute max_resource_usage based on max_actors, and falls back to infinite resources for unbounded pools. The accompanying tests are well-written and validate both the bounded and unbounded scenarios. I have one minor suggestion to improve code readability and reduce duplication.

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

gemini-code-assist · 2025-12-23T17:30:12Z

Summary of Changes

This pull request resolves an issue in the resource_manager where GPU autoscaling was inadvertently prevented due to max_resource_usage always being set to an infinite value, even when a max_actors limit was specified. The changes introduce a more accurate calculation for max_resource_usage, ensuring that it aligns with the configured max_actors for bounded actor pools, thereby enabling effective GPU autoscaling, while still correctly reporting infinite resources for unbounded pools.

Highlights

Corrected GPU Autoscaling Logic: The calculation for max_resource_usage now correctly reflects the max_actors limits instead of always being infinite, which previously hindered GPU autoscaling for bounded actor pools.
Conditional Max Resource Usage Calculation: Introduced logic to dynamically compute max_resource_usage: if max_actors is finite, resources are scaled by max_actors; otherwise, ExecutionResources.for_limits() (infinite resources) is returned.
Enhanced Test Coverage: Added a new test test_min_max_resource_requirements_unbounded to validate infinite resource reporting for unbounded pools and updated test_min_max_resource_requirements to assert finite resource limits for bounded pools.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
- Imported the math module to utilize math.isinf for checking infinite values.
- Modified the min_max_resource_requirements method to dynamically calculate max_resource_usage based on max_actors. If max_actors is infinite, ExecutionResources.for_limits() is returned; otherwise, resources are scaled by max_actors.
python/ray/data/tests/test_actor_pool_map_operator.py
- Updated test_min_max_resource_requirements to assert the correct finite max_resource_usage_bound when max_actors is specified.
- Added a new test test_min_max_resource_requirements_unbounded to verify that unbounded actor pools correctly report infinite maximum resources.

Activity

The author, goutamvenkat-anyscale, requested a summary of the pull request.
A bot, gemini-code-assist[bot], provided a suggestion to refactor the object_store_memory calculation within the min_max_resource_requirements method for improved readability and reduced repetition.

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale · 2025-12-26T21:04:31Z

/gemini summary

gemini-code-assist · 2025-12-26T21:04:48Z

Summary of Changes

This pull request addresses a critical issue in Ray Data's resource management that previously prevented GPU autoscaling for actor pools with defined maximum limits. By eliminating special-casing for GPU allocation, the system now uniformly caps all resource types (CPU, GPU, memory) based on an operator's maximum resource usage. This ensures that bounded actor pools can correctly autoscale their GPU resources, while unbounded operators appropriately receive any remaining shared resources, leading to more predictable and efficient resource utilization.

Highlights

GPU Autoscaling Fix: Resolved a bug where GPU autoscaling was hindered for bounded actor pools due to incorrect max_resource_usage calculations, preventing them from scaling beyond initial actors.
Unified Resource Allocation: Removed special-casing for GPU allocation in the ReservationOpResourceAllocator, ensuring GPUs are now allocated through the same uniform path as CPU and memory.
Consistent Resource Requirements: Introduced a min_max_resource_requirements method in TaskPoolMapOperator for consistency, allowing it to report min/max resource needs similar to ActorPoolMapOperator.
Enhanced Test Coverage: Added new tests and updated existing ones to validate the corrected GPU allocation logic for both bounded and unbounded operators, ensuring proper autoscaling behavior.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
- Imported the math module to support math.isinf checks.
- Modified min_max_resource_requirements to correctly calculate max_resource_usage based on max_actors, returning infinite resources only for truly unbounded pools.
python/ray/data/_internal/execution/operators/task_pool_map_operator.py
- Added the min_max_resource_requirements method to provide consistent min/max resource reporting for task-based operators.
python/ray/data/_internal/execution/resource_manager.py
- Removed the GPU special-casing logic from update_budgets, integrating GPU allocation into the general resource allocation flow.
- Adjusted the handling of remaining shared resources for unbounded operators to include GPUs.
python/ray/data/tests/test_actor_pool_map_operator.py
- Updated test_min_max_resource_requirements to use num_gpus and assert correct GPU and object store memory limits for bounded pools.
- Added test_min_max_resource_requirements_unbounded to verify that unbounded actor pools correctly report infinite maximum resources.
python/ray/data/tests/test_resource_manager.py
- Introduced mocks for min_max_resource_requirements in various tests to simulate operator behavior.
- Updated test_gpu_allocation to reflect the new unified GPU allocation, asserting that both GPU and non-GPU operators receive appropriate GPU budgets.
- Added _mem_op_internal and _mem_op_outputs mocks for more accurate resource manager testing.
- Modified assertions in test_multiple_gpu_operators to align with capped GPU allocation behavior.
- Added new tests test_gpu_unbounded_operator_can_autoscale and test_gpu_bounded_vs_unbounded_operators to specifically validate GPU autoscaling for different operator types.

Activity

The author, goutamvenkat-anyscale, requested a summary of the pull request.
A bot, gemini-code-assist[bot], provided an initial summary and suggested refactoring the object_store_memory calculation for improved readability.
The author, goutamvenkat-anyscale, requested another summary.

Signed-off-by: Goutam <[email protected]>

bveeramani

Nice

alexeykudinkin · 2025-12-31T06:25:28Z

python/ray/data/tests/test_resource_manager.py

+        assert allocator._op_budgets[o2].gpu == 0
+        assert allocator._op_budgets[o3].gpu == 1


Don't assert on internal, assert on state retrieve t/h methods

alexeykudinkin · 2025-12-31T06:26:06Z

python/ray/data/_internal/execution/resource_manager.py

-            if (
-                max_resource_usage != ExecutionResources.inf()
-                and max_resource_usage.gpu > 0
-            ):
-                # If an operator needs GPU, we just allocate all GPUs to it.
-                # TODO(hchen): allocate resources across multiple GPU operators.
-
-                # The op_usage can be more than the global limit in the following cases:
-                # 1. The op is setting a minimum concurrency that is larger than
-                #    available num of GPUs.
-                # 2. The cluster scales down, and the global limit decreases.
-                target_num_gpu = max(
-                    limits.gpu - self._resource_manager.get_op_usage(op).gpu,
-                    0,
-                )
-            else:
-                target_num_gpu = 0
-
-            self._op_budgets[op] = (
-                self._op_budgets[op].add(op_shared).copy(gpu=target_num_gpu)
-            )


Why are we removing this?

This now allocates GPUs for non-GPU operators

cc @goutamvenkat-anyscale @bveeramani

## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]> Signed-off-by: jasonwrwang <[email protected]>

## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>

## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]> Signed-off-by: peterxcli <[email protected]>

[Data] - Fix GPU autoscaling if max_actors is set

3e4ca73

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale requested a review from a team as a code owner December 23, 2025 17:27

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Dec 23, 2025

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Show resolved Hide resolved

goutamvenkat-anyscale added 3 commits December 23, 2025 09:30

Use GPU for test

1c538e7

Signed-off-by: Goutam <[email protected]>

Remove special casing for GPU

0f8645b

Signed-off-by: Goutam <[email protected]>

merge from master

1d76a90

Signed-off-by: Goutam <[email protected]>

goutamvenkat-anyscale added 2 commits December 26, 2025 13:06

gemini comment

93d5522

Signed-off-by: Goutam <[email protected]>

Fix test

c2af380

Signed-off-by: Goutam <[email protected]>

bveeramani approved these changes Dec 29, 2025

View reviewed changes

bveeramani merged commit b2028df into ray-project:master Dec 29, 2025
6 checks passed

alexeykudinkin reviewed Dec 31, 2025

View reviewed changes

goutamvenkat-anyscale mentioned this pull request Dec 31, 2025

[Data] - Don't reserve GPU budget for non-GPU tasks #59789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] - Fix GPU autoscaling if max_actors is set#59632

[Data] - Fix GPU autoscaling if max_actors is set#59632
bveeramani merged 6 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_gpu_autoscaling

goutamvenkat-anyscale commented Dec 23, 2025 •

edited

Loading

Uh oh!

goutamvenkat-anyscale commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

goutamvenkat-anyscale commented Dec 26, 2025

Uh oh!

gemini-code-assist bot commented Dec 26, 2025

Uh oh!

bveeramani left a comment

Uh oh!

Uh oh!

alexeykudinkin Dec 31, 2025

Uh oh!

alexeykudinkin Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert allocator._op_budgets[o2].gpu == 0
		assert allocator._op_budgets[o3].gpu == 1

Conversation

goutamvenkat-anyscale commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

goutamvenkat-anyscale commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Uh oh!

goutamvenkat-anyscale commented Dec 26, 2025

Uh oh!

gemini-code-assist bot commented Dec 26, 2025

Summary of Changes

Highlights

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeykudinkin Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Dec 23, 2025 •

edited

Loading