Skip to content

[Data] - Fix GPU autoscaling if max_actors is set#59632

Merged
bveeramani merged 6 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_gpu_autoscaling
Dec 29, 2025
Merged

[Data] - Fix GPU autoscaling if max_actors is set#59632
bveeramani merged 6 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_gpu_autoscaling

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Dec 23, 2025

Description

Previously, GPU allocation was special-cased in ReservationOpResourceAllocator:

  1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage
  2. The check max_resource_usage != inf() and max_resource_usage.gpu > 0 failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget
  3. GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor.

Changes
task_pool_map_operator.py

  • Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator
  • Returns (min=1 task resources, max=max_concurrency * task resources or inf)

resource_manager.py

  • Removed GPU special-casing entirely
  • GPU now flows through the same allocation path as CPU and memory
  • Operators are capped by their max_resource_usage for all resource types uniformly
  • Remaining shared resources (including GPU) go to unbounded downstream operators

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner December 23, 2025 17:27
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Dec 23, 2025
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue with GPU autoscaling when max_actors is set by correctly calculating the maximum resource requirements in ActorPoolMapOperator. The change introduces logic to compute max_resource_usage based on max_actors, and falls back to infinite resources for unbounded pools. The accompanying tests are well-written and validate both the bounded and unbounded scenarios. I have one minor suggestion to improve code readability and reduce duplication.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request resolves an issue in the resource_manager where GPU autoscaling was inadvertently prevented due to max_resource_usage always being set to an infinite value, even when a max_actors limit was specified. The changes introduce a more accurate calculation for max_resource_usage, ensuring that it aligns with the configured max_actors for bounded actor pools, thereby enabling effective GPU autoscaling, while still correctly reporting infinite resources for unbounded pools.

Highlights

  • Corrected GPU Autoscaling Logic: The calculation for max_resource_usage now correctly reflects the max_actors limits instead of always being infinite, which previously hindered GPU autoscaling for bounded actor pools.
  • Conditional Max Resource Usage Calculation: Introduced logic to dynamically compute max_resource_usage: if max_actors is finite, resources are scaled by max_actors; otherwise, ExecutionResources.for_limits() (infinite resources) is returned.
  • Enhanced Test Coverage: Added a new test test_min_max_resource_requirements_unbounded to validate infinite resource reporting for unbounded pools and updated test_min_max_resource_requirements to assert finite resource limits for bounded pools.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
    • Imported the math module to utilize math.isinf for checking infinite values.
    • Modified the min_max_resource_requirements method to dynamically calculate max_resource_usage based on max_actors. If max_actors is infinite, ExecutionResources.for_limits() is returned; otherwise, resources are scaled by max_actors.
  • python/ray/data/tests/test_actor_pool_map_operator.py
    • Updated test_min_max_resource_requirements to assert the correct finite max_resource_usage_bound when max_actors is specified.
    • Added a new test test_min_max_resource_requirements_unbounded to verify that unbounded actor pools correctly report infinite maximum resources.
Activity
  • The author, goutamvenkat-anyscale, requested a summary of the pull request.
  • A bot, gemini-code-assist[bot], provided a suggestion to refactor the object_store_memory calculation within the min_max_resource_requirements method for improved readability and reduced repetition.

@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request addresses a critical issue in Ray Data's resource management that previously prevented GPU autoscaling for actor pools with defined maximum limits. By eliminating special-casing for GPU allocation, the system now uniformly caps all resource types (CPU, GPU, memory) based on an operator's maximum resource usage. This ensures that bounded actor pools can correctly autoscale their GPU resources, while unbounded operators appropriately receive any remaining shared resources, leading to more predictable and efficient resource utilization.

Highlights

  • GPU Autoscaling Fix: Resolved a bug where GPU autoscaling was hindered for bounded actor pools due to incorrect max_resource_usage calculations, preventing them from scaling beyond initial actors.
  • Unified Resource Allocation: Removed special-casing for GPU allocation in the ReservationOpResourceAllocator, ensuring GPUs are now allocated through the same uniform path as CPU and memory.
  • Consistent Resource Requirements: Introduced a min_max_resource_requirements method in TaskPoolMapOperator for consistency, allowing it to report min/max resource needs similar to ActorPoolMapOperator.
  • Enhanced Test Coverage: Added new tests and updated existing ones to validate the corrected GPU allocation logic for both bounded and unbounded operators, ensuring proper autoscaling behavior.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
    • Imported the math module to support math.isinf checks.
    • Modified min_max_resource_requirements to correctly calculate max_resource_usage based on max_actors, returning infinite resources only for truly unbounded pools.
  • python/ray/data/_internal/execution/operators/task_pool_map_operator.py
    • Added the min_max_resource_requirements method to provide consistent min/max resource reporting for task-based operators.
  • python/ray/data/_internal/execution/resource_manager.py
    • Removed the GPU special-casing logic from update_budgets, integrating GPU allocation into the general resource allocation flow.
    • Adjusted the handling of remaining shared resources for unbounded operators to include GPUs.
  • python/ray/data/tests/test_actor_pool_map_operator.py
    • Updated test_min_max_resource_requirements to use num_gpus and assert correct GPU and object store memory limits for bounded pools.
    • Added test_min_max_resource_requirements_unbounded to verify that unbounded actor pools correctly report infinite maximum resources.
  • python/ray/data/tests/test_resource_manager.py
    • Introduced mocks for min_max_resource_requirements in various tests to simulate operator behavior.
    • Updated test_gpu_allocation to reflect the new unified GPU allocation, asserting that both GPU and non-GPU operators receive appropriate GPU budgets.
    • Added _mem_op_internal and _mem_op_outputs mocks for more accurate resource manager testing.
    • Modified assertions in test_multiple_gpu_operators to align with capped GPU allocation behavior.
    • Added new tests test_gpu_unbounded_operator_can_autoscale and test_gpu_bounded_vs_unbounded_operators to specifically validate GPU autoscaling for different operator types.
Activity
  • The author, goutamvenkat-anyscale, requested a summary of the pull request.
  • A bot, gemini-code-assist[bot], provided an initial summary and suggested refactoring the object_store_memory calculation for improved readability.
  • The author, goutamvenkat-anyscale, requested another summary.

Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani bveeramani merged commit b2028df into ray-project:master Dec 29, 2025
6 checks passed
Comment on lines +987 to +988
assert allocator._op_budgets[o2].gpu == 0
assert allocator._op_budgets[o3].gpu == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assert on internal, assert on state retrieve t/h methods

Comment on lines -882 to -902
if (
max_resource_usage != ExecutionResources.inf()
and max_resource_usage.gpu > 0
):
# If an operator needs GPU, we just allocate all GPUs to it.
# TODO(hchen): allocate resources across multiple GPU operators.

# The op_usage can be more than the global limit in the following cases:
# 1. The op is setting a minimum concurrency that is larger than
# available num of GPUs.
# 2. The cluster scales down, and the global limit decreases.
target_num_gpu = max(
limits.gpu - self._resource_manager.get_op_usage(op).gpu,
0,
)
else:
target_num_gpu = 0

self._op_budgets[op] = (
self._op_budgets[op].add(op_shared).copy(gpu=target_num_gpu)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

This now allocates GPUs for non-GPU operators

cc @goutamvenkat-anyscale @bveeramani

AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description

Previously, GPU allocation was special-cased in
`ReservationOpResourceAllocator`:
1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu)
regardless of their max_resource_usage
2. The check `max_resource_usage != inf() and max_resource_usage.gpu >
0` failed for unbounded actor pools (max_size=None), causing them to get
zero GPU budget
3. GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None)
with GPU actors couldn't autoscale beyond the initial actor.

**Changes**
_task_pool_map_operator.py_
- Added min_max_resource_requirements() method for consistency with
ActorPoolMapOperator
- Returns (min=1 task resources, max=max_concurrency * task resources or
inf)

_resource_manager.py_

- Removed GPU special-casing entirely
- GPU now flows through the same allocation path as CPU and memory
- Operators are capped by their max_resource_usage for all resource
types uniformly
- Remaining shared resources (including GPU) go to unbounded downstream
operators

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Signed-off-by: jasonwrwang <[email protected]>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description

Previously, GPU allocation was special-cased in
`ReservationOpResourceAllocator`:
1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu)
regardless of their max_resource_usage
2. The check `max_resource_usage != inf() and max_resource_usage.gpu >
0` failed for unbounded actor pools (max_size=None), causing them to get
zero GPU budget
3. GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None)
with GPU actors couldn't autoscale beyond the initial actor.

**Changes**
_task_pool_map_operator.py_
- Added min_max_resource_requirements() method for consistency with
ActorPoolMapOperator
- Returns (min=1 task resources, max=max_concurrency * task resources or
inf)

_resource_manager.py_

- Removed GPU special-casing entirely
- GPU now flows through the same allocation path as CPU and memory
- Operators are capped by their max_resource_usage for all resource
types uniformly
- Remaining shared resources (including GPU) go to unbounded downstream
operators


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description

Previously, GPU allocation was special-cased in
`ReservationOpResourceAllocator`:
1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu)
regardless of their max_resource_usage
2. The check `max_resource_usage != inf() and max_resource_usage.gpu >
0` failed for unbounded actor pools (max_size=None), causing them to get
zero GPU budget
3. GPU was stripped from remaining shared resources (.copy(gpu=0))

This caused a bug where ActorPoolStrategy(min_size=1, max_size=None)
with GPU actors couldn't autoscale beyond the initial actor.

**Changes**
_task_pool_map_operator.py_
- Added min_max_resource_requirements() method for consistency with
ActorPoolMapOperator
- Returns (min=1 task resources, max=max_concurrency * task resources or
inf)

_resource_manager.py_

- Removed GPU special-casing entirely
- GPU now flows through the same allocation path as CPU and memory
- Operators are capped by their max_resource_usage for all resource
types uniformly
- Remaining shared resources (including GPU) go to unbounded downstream
operators

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <[email protected]>
Signed-off-by: peterxcli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants