[core] Fix deadlock when cancelling stale requests on in-order actors#57746
Merged
dayshah merged 3 commits intoray-project:masterfrom Oct 15, 2025
Merged
[core] Fix deadlock when cancelling stale requests on in-order actors#57746dayshah merged 3 commits intoray-project:masterfrom
dayshah merged 3 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Contributor
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a deadlock that occurred when canceling stale requests for in-order actors. The approach of replacing the mutex-guarded boolean stopping_ with a std::atomic<bool> is a clean and correct solution to the re-entrant lock problem. The changes in task_receiver.cc and task_receiver.h are well-implemented. Additionally, moving and extending the Python tests to cover the in-order execution case is a valuable addition that ensures the fix is properly verified. Overall, this is a solid improvement to the codebase's stability. I have one minor suggestion for code simplification.
edoakes
reviewed
Oct 15, 2025
edoakes
approved these changes
Oct 15, 2025
dayshah
added a commit
to dayshah/ray
that referenced
this pull request
Oct 15, 2025
…n-order actors (ray-project#57746) Signed-off-by: dayshah <[email protected]>
aslonnie
pushed a commit
that referenced
this pull request
Oct 16, 2025
…n-order actors (#57746) (#57768) ## Description Cherry picking #57746 Signed-off-by: dayshah <[email protected]>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…ray-project#57746) Signed-off-by: dayshah <[email protected]>
xinyuangui2
pushed a commit
to xinyuangui2/ray
that referenced
this pull request
Oct 22, 2025
…ray-project#57746) Signed-off-by: dayshah <[email protected]> Signed-off-by: xgui <[email protected]>
elliot-barn
pushed a commit
that referenced
this pull request
Oct 23, 2025
…#57746) Signed-off-by: dayshah <[email protected]> Signed-off-by: elliot-barn <[email protected]>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…ray-project#57746) Signed-off-by: dayshah <[email protected]>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…ray-project#57746) Signed-off-by: dayshah <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…ray-project#57746) Signed-off-by: dayshah <[email protected]> Signed-off-by: Future-Outlier <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Today we hold
stop_muwhile calling intoActorSchedulingQueue::Add. This calls intoActorSchedulingQueue::ScheduleRequestswhich can potentially cancel and therefore call the cancel callback which also tries to acquirestop_muin the same call stack.Cancel inside
ActorSchedulingQueue::ScheduleRequestsray/src/ray/core_worker/task_execution/actor_scheduling_queue.cc
Line 174 in c6b8c9f
cancel_callback grabbing
stop_muray/src/ray/core_worker/task_execution/task_receiver.cc
Line 176 in c6b8c9f
Grabbing
stop_muinTaskReceiver::HandleTaskwhich eventually leads intoActorSchedulingQueue::ScheduleRequestsray/src/ray/core_worker/task_execution/task_receiver.cc
Line 195 in c6b8c9f
Solution
The solution here is just to turn
stopping_into an atomic bool. The mutex only exists to protect this.Extra
test_transient_error_retrytotest_push_actor_task_failureand moving it andtest_update_object_location_batch_failureto test_core_worker_fault_tolerance