[Data] - Fix GPU autoscaling if max_actors is set#59632
[Data] - Fix GPU autoscaling if max_actors is set#59632bveeramani merged 6 commits intoray-project:masterfrom
Conversation
Signed-off-by: Goutam <[email protected]>
|
/gemini summary |
There was a problem hiding this comment.
Code Review
This pull request fixes an issue with GPU autoscaling when max_actors is set by correctly calculating the maximum resource requirements in ActorPoolMapOperator. The change introduces logic to compute max_resource_usage based on max_actors, and falls back to infinite resources for unbounded pools. The accompanying tests are well-written and validate both the bounded and unbounded scenarios. I have one minor suggestion to improve code readability and reduce duplication.
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
Summary of ChangesThis pull request resolves an issue in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
|
/gemini summary |
Summary of ChangesThis pull request addresses a critical issue in Ray Data's resource management that previously prevented GPU autoscaling for actor pools with defined maximum limits. By eliminating special-casing for GPU allocation, the system now uniformly caps all resource types (CPU, GPU, memory) based on an operator's maximum resource usage. This ensures that bounded actor pools can correctly autoscale their GPU resources, while unbounded operators appropriately receive any remaining shared resources, leading to more predictable and efficient resource utilization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
Signed-off-by: Goutam <[email protected]>
Signed-off-by: Goutam <[email protected]>
| assert allocator._op_budgets[o2].gpu == 0 | ||
| assert allocator._op_budgets[o3].gpu == 1 |
There was a problem hiding this comment.
Don't assert on internal, assert on state retrieve t/h methods
| if ( | ||
| max_resource_usage != ExecutionResources.inf() | ||
| and max_resource_usage.gpu > 0 | ||
| ): | ||
| # If an operator needs GPU, we just allocate all GPUs to it. | ||
| # TODO(hchen): allocate resources across multiple GPU operators. | ||
|
|
||
| # The op_usage can be more than the global limit in the following cases: | ||
| # 1. The op is setting a minimum concurrency that is larger than | ||
| # available num of GPUs. | ||
| # 2. The cluster scales down, and the global limit decreases. | ||
| target_num_gpu = max( | ||
| limits.gpu - self._resource_manager.get_op_usage(op).gpu, | ||
| 0, | ||
| ) | ||
| else: | ||
| target_num_gpu = 0 | ||
|
|
||
| self._op_budgets[op] = ( | ||
| self._op_budgets[op].add(op_shared).copy(gpu=target_num_gpu) | ||
| ) |
There was a problem hiding this comment.
Why are we removing this?
This now allocates GPUs for non-GPU operators
## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]> Signed-off-by: jasonwrwang <[email protected]>
## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>
## Description Previously, GPU allocation was special-cased in `ReservationOpResourceAllocator`: 1. GPU operators got all available GPUs (limits.gpu - op_usage.gpu) regardless of their max_resource_usage 2. The check `max_resource_usage != inf() and max_resource_usage.gpu > 0` failed for unbounded actor pools (max_size=None), causing them to get zero GPU budget 3. GPU was stripped from remaining shared resources (.copy(gpu=0)) This caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor. **Changes** _task_pool_map_operator.py_ - Added min_max_resource_requirements() method for consistency with ActorPoolMapOperator - Returns (min=1 task resources, max=max_concurrency * task resources or inf) _resource_manager.py_ - Removed GPU special-casing entirely - GPU now flows through the same allocation path as CPU and memory - Operators are capped by their max_resource_usage for all resource types uniformly - Remaining shared resources (including GPU) go to unbounded downstream operators ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]> Signed-off-by: peterxcli <[email protected]>
Description
Previously, GPU allocation was special-cased in
ReservationOpResourceAllocator:max_resource_usage != inf() and max_resource_usage.gpu > 0failed for unbounded actor pools (max_size=None), causing them to get zero GPU budgetThis caused a bug where ActorPoolStrategy(min_size=1, max_size=None) with GPU actors couldn't autoscale beyond the initial actor.
Changes
task_pool_map_operator.py
resource_manager.py
Related issues
Additional information