[train][checkpoint] Add checkpoint_upload_mode to ray.train.report by TimothySeah · Pull Request #55637 · ray-project/ray

TimothySeah · 2025-08-15T01:16:41Z

Summary

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.

Introduce per-worker checkpoint counters to preserve report order.
Use a thread pool to limit concurrent uploads and avoid OOM.
Wrap the training function to wait for pending uploads before exiting.
Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

Implementation Summary

This PR implements async checkpointing by

Adding a checkpoint_upload_mode to ray.train.report with three options
- SYNC: synchronous upload - current and default behavior
- ASYNC: asynchronous upload - the main goal of this PR
- NONE: do not upload checkpoint - useful when users upload checkpoints through different means like [RayTrain] Manual checkpoint persistence to storage #52762
Maintaining internal-only num_reported_checkpoints and num_attempted_reported_checkpoints counters on the TrainContext
Regardless of which combinations of checkpoint_upload_mode the different Ray Train workers are doing, we want to upload checkpoints in the order they were ray.train.reported. Therefore, each Ray Train worker waits for its turn (num_reported_checkpoints == current_report_attempt_number - 1) before adding its checkpoint to the result queue.
Uploading too many checkpoints concurrently runs the risk of OOM-ing. I included a ThreadPoolExecutor to guard against adding too many checkpoint upload threads.
I changed run_train_fn to wrap the train_fn in train_fn_that_waits_for_threads because otherwise, we could be in the following situation: 1) train function exits with pending report threads and worker status is finished 2) controller sees finished status and shuts down worker group 3) result.fit does not return all the reported checkpoints/metrics
- I decided to implement "early exit" in ThreadRunner but "wait for threads" as a wrapper function because in the former case, that is the cleanest way for a nested thread to cause the entire worker to exit early, but in this case, the target function is able to wait for the threads that it creates without complicating the ThreadRunner abstraction.

A few other notes:

I decided to only add Checkpoints (instead of Checkpoint ObjectRefs) to the result queue because:
- If we went with the ObjectRef approach, the controller would create a Ray task that updates controller state. This "driver creates task that updates driver" pattern is unwieldy to implement.
- Every worker must upload its checkpoint so it makes sense to confine this logic to the worker rather than making the controller even more complicated than it already is.
One interesting corner case I found while unit testing this PR is that async checkpoint uploads don't work with temporary directories because we might exit the temporary directory's scope before we kick off the checkpoint upload.

API Changes

This PR's only API changes are adding the following two arguments to ray.train.report:

checkpoint_upload_mode:
- SYNC: synchronous upload - current and default behavior
- ASYNC: asynchronous upload - the main goal of this PR
- NONE: do not upload checkpoint - useful when users upload checkpoints themselves
delete_local_checkpoint_after_upload: Whether to delete the checkpoint after uploading it. Users generally won't need to set this since each checkpoint upload mode has its own default:
- SYNC: False because users will generally use tempfile
- ASYNC: True because users can't use tempfile - see previous section for explanation
- NO_UPLOAD: False because there is no local directory to delete

Here's a simple example of this API in action:

def train_func():
    ...
    ray.train.report(
        metrics={},
        checkpoint=Checkpoint.from_directory(checkpoint_dir),
        checkpoint_upload_mode=CheckpointUploadMode.ASYNC,
    )

Testing

Looks like async reporting is indeed faster with the same loss on the pytorch ray train example: https://docs.ray.io/en/latest/train/getting-started-pytorch.html

Sync mode

(RayTrainWorker pid=6316, ip=10.0.180.165) Blocked times: [0.2751896381378174, 0.28496718406677246, 0.26192378997802734, 0.25046420097351074, 0.2681725025177002, 0.2644937038421631, 0.27478623390197754, 0.2887108325958252, 0.36760926246643066, 0.32657504081726074] with total 2.8628923892974854

{'loss': 0.04541657865047455, 'epoch': 9}

3m3s

Async mode

Async mode: (RayTrainWorker pid=8960, ip=10.0.185.254) Blocked times: [0.005418062210083008, 0.004388093948364258, 0.0045735836029052734, 0.004605531692504883, 0.004551887512207031, 0.0025453567504882812, 0.021490812301635742, 0.014325380325317383, 0.0050373077392578125, 0.004266977310180664] with total 0.07120299339294434

{'loss': 0.04778982326388359, 'epoch': 9}

2m57s with only ~0.22s blocking time when waiting for the last checkpoint upload:

I | 2025-08-29 20:44:25.066 | Blocked times: [0.005418062210083008, 0.004388093948364258, 0.0045735836029052734, 0.004605531692504883, 0.004551887512207031, 0.0025453567504882812, 0.021490812301635742, 0.014325380325317383, 0.0050373077392578125, 0.004266977310180664] with total 0.07120299339294434
-- | -- | --
I | 2025-08-29 20:44:25.289 | Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-29_20-44-24.990364)

Signed-off-by: Timothy Seah <[email protected]>

gemini-code-assist

Code Review

This pull request introduces asynchronous checkpointing, a valuable feature for improving performance. The implementation is mostly well-structured, but I've identified a critical issue in the asynchronous handling logic that could lead to deadlocks under failure conditions. I've provided a detailed comment and a suggested fix for this. Additionally, there's a minor type hint mismatch that should be corrected for code clarity and correctness. It would also be beneficial to add tests for failure scenarios in asynchronous checkpointing to ensure the system's robustness.

python/ray/train/v2/_internal/execution/context.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <[email protected]>

Signed-off-by: Timothy Seah <[email protected]>

kushalthaman · 2025-08-22T17:13:21Z

Looks great! Some suggestions for context.py:

upload_checkpoint_and_report_function: wrap upload+report in try/except and thread cleanup in try/finally to avoid deadlocks if upload fails, e.g.

    def _upload_checkpoint_and_report(...):
        try:
            try:
                training_result = self._save_checkpoint(checkpoint_dir_name, metrics, checkpoint)
            except Exception as e:
                logger.exception("Async checkpoint upload failed; reporting failure to advance order.")
                training_result = _TrainingResult(checkpoint=None, metrics=metrics)
            self._wait_then_report(training_result, current_report_attempt_number)
        finally:
            with self.max_uploads_condition:
                self.ordered_checkpoint_upload_threads.pop(current_report_attempt_number, None)
                self.max_uploads_condition.notify_all()

Make MAX_CHECKPOINT_UPLOAD_THREADS configurable through the RunConfig knob
Add exponential backoff retries and a per-attempt upload timeout around save_checkpoint perhaps with structured logs per attempt
Is the report order guarantee global across ranks or per-worker? Currently it seems to be per worker (per TrainContext). But if the controller requires global per-iteration ordering, maybe we should document the contract and add a test that mixes ASYNC/SYNC/NONE across ranks and assert that the controller output order matches report order.
Adding progress reporting (bytes transferred, elapsed, speed) is useful, as well as aggregate metrics (success/failure counts, avg upload time)
Maybe a lightweight ThreadPool or per-worker executor would be cleaner, instead of spawning many Threads

Happy to work on a follow-up PR branch with these fixes (try/finally + retries/timeouts + config), or help you with any of these.

TimothySeah · 2025-08-22T18:43:52Z

Thanks for the awesome feedback!

upload_checkpoint_and_report_function: wrap upload+report in try/except and thread cleanup in try/finally to avoid deadlocks if upload fails

Good callout - I am currently working on #55756, after which I will be able to propagate the exception up to the main thread. Once that's merged I will use that and add the try/finally to this PR.

Make MAX_CHECKPOINT_UPLOAD_THREADS configurable through the RunConfig knob

I think this could be a good followup PR - feel free to create a new issue and assign it to yourself. I would suggest subclassing CheckpointConfig for Ray Train v2 and adding the knob there.

Add exponential backoff retries and a per-attempt upload timeout around save_checkpoint perhaps with structured logs per attempt

I think this would also be a good followup PR - feel free to create a GitHub issue and assign it to yourself.

Is the report order guarantee global across ranks or per-worker? Currently it seems to be per worker (per TrainContext). But if the controller requires global per-iteration ordering, maybe we should document the contract and add a test that mixes ASYNC/SYNC/NONE across ranks and assert that the controller output order matches report order.

Reports should happen in order of submission per worker but each report also forms a barrier across all workers (see the warning in https://docs.ray.io/en/latest/train/api/doc/ray.train.report.html).

What kind of documentation did you have in mind? I think https://github.com/ray-project/ray/blob/master/python/ray/train/v2/api/train_fn_utils.py#L36 mentions something to that effect but feel free to suggest modifications.

Can you elaborate on the test? I think https://github.com/ray-project/ray/pull/55637/files#diff-7df817b7f2f904c441481b97cc76db79e527b265b8b7a8fadbc78e722f16f208R147 tries to do something similar but let me know if you have any suggestions.

Adding progress reporting (bytes transferred, elapsed, speed) is useful, as well as aggregate metrics (success/failure counts, avg upload time)

I think this could also be worth filing a GitHub issue and assigning to yourself. My only concerns are to make sure it doesn't slow down training and that it isn't too noisy e.g. maybe we can toggle it with an environment variable.

Maybe a lightweight ThreadPool or per-worker executor would be cleaner, instead of spawning many Threads

After some more thought I think a ThreadPool would indeed be better - let me change the PR accordingly.

Signed-off-by: Timothy Seah <[email protected]>

The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (#55637). --------- Signed-off-by: Timothy Seah <[email protected]>

Signed-off-by: Timothy Seah <[email protected]>

…point

Signed-off-by: Timothy Seah <[email protected]>

…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Timothy Seah <[email protected]>

…point

Signed-off-by: Timothy Seah <[email protected]>

justinvyu

great stuff 💯

python/ray/train/v2/_internal/execution/worker_group/worker.py

python/ray/train/v2/api/report_config.py

python/ray/train/v2/_internal/execution/worker_group/worker.py

python/ray/train/v2/_internal/execution/context.py

Signed-off-by: Timothy Seah <[email protected]>

…point

python/ray/train/v2/_internal/execution/context.py

python/ray/train/v2/tests/test_data_parallel_trainer.py

python/ray/train/v2/_internal/execution/context.py

Signed-off-by: Timothy Seah <[email protected]>

…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: jugalshah291 <[email protected]>

justinvyu

🚢

…ay-project#55637) Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD. * Introduce per-worker checkpoint counters to preserve report order. * Use a thread pool to limit concurrent uploads and avoid OOM. * Wrap the training function to wait for pending uploads before exiting. * Add delete_local_checkpoint_after_upload to control temporary local directory cleanup. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: zac <[email protected]>

…ay-project#55637) Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD. * Introduce per-worker checkpoint counters to preserve report order. * Use a thread pool to limit concurrent uploads and avoid OOM. * Wrap the training function to wait for pending uploads before exiting. * Add delete_local_checkpoint_after_upload to control temporary local directory cleanup. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Marco Stephan <[email protected]>

#56208) After #55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Seiji Eicher <[email protected]>

…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

…55637) Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD. * Introduce per-worker checkpoint counters to preserve report order. * Use a thread pool to limit concurrent uploads and avoid OOM. * Wrap the training function to wait for pending uploads before exiting. * Add delete_local_checkpoint_after_upload to control temporary local directory cleanup. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Douglas Strodtman <[email protected]>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]>

Original PR #55637 by TimothySeah Original: ray-project/ray#55637

… to ray.train.report Merged from original PR #55637 Original: ray-project/ray#55637

Original PR #55637 by TimothySeah Original: ray-project/ray#55637

… to ray.train.report Merged from original PR #55637 Original: ray-project/ray#55637

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Josh Kodi <[email protected]>

…ay-project#55637) Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD. * Introduce per-worker checkpoint counters to preserve report order. * Use a thread pool to limit concurrent uploads and avoid OOM. * Wrap the training function to wait for pending uploads before exiting. * Add delete_local_checkpoint_after_upload to control temporary local directory cleanup. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]>

Original PR #55637 by TimothySeah Original: ray-project/ray#55637

… to ray.train.report Merged from original PR #55637 Original: ray-project/ray#55637

…ject#55756) The `ThreadRunner` is an abstraction used by Ray Train to capture errors raised by the training function so they can be polled by the Ray Train controller. This PR extends the `ThreadRunner` to also capture errors raised by threads created by the training function e.g. async checkpoint upload threads (ray-project#55637). --------- Signed-off-by: Timothy Seah <[email protected]>

…ay-project#55637) Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD. * Introduce per-worker checkpoint counters to preserve report order. * Use a thread pool to limit concurrent uploads and avoid OOM. * Wrap the training function to wait for pending uploads before exiting. * Add delete_local_checkpoint_after_upload to control temporary local directory cleanup. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

ray-project#56208) After ray-project#55637, `ray.train.report` will allow users to upload checkpoints from disk to remote storage asynchronously. If they want to use framework-specific async checkpointing like `torch.async_save`, they can manage `torch.async_save` themselves and then call `ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`. However, it would also be nice to allow `ray.train.report` to handle rate limiting and report ordering for framework-specific async checkpointing as well. This PR achieves this by exposing a `checkpoint_upload_function` argument that can replace the `persist_current_checkpoint` call. --------- Signed-off-by: Timothy Seah <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report

6782290

Signed-off-by: Timothy Seah <[email protected]>

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/context.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/context.py Outdated Show resolved Hide resolved

TimothySeah mentioned this pull request Aug 19, 2025

[train] ThreadRunner captures exceptions from nested threads #55756

Merged

TimothySeah and others added 2 commits August 20, 2025 17:43

Update python/ray/train/v2/_internal/execution/context.py

56006be

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <[email protected]>

wait for nested threads to finish before reporting finished

782ef24

Signed-off-by: Timothy Seah <[email protected]>

TimothySeah mentioned this pull request Aug 22, 2025

[python/ray/train/_internal] Add AsyncCheckpointWriter for non-blocking checkpoint uploads #55800

Closed

Use ThreadPoolExecutor instead as suggested by reviewer

9c878e6

Signed-off-by: Timothy Seah <[email protected]>

TimothySeah added 3 commits August 28, 2025 17:48

add test_worker bazel rule

174eb57

Signed-off-by: Timothy Seah <[email protected]>

Merge remote-tracking branch 'upstream/master' into tseah/async-check…

0b30b69

…point

add async upload invalid checkpoint unit test

d40859b

Signed-off-by: Timothy Seah <[email protected]>

TimothySeah added 2 commits August 29, 2025 17:09

properly unit test async upload invalid checkpoint

cc3ee56

Signed-off-by: Timothy Seah <[email protected]>

Add clarifying comment

a24bd0f

Signed-off-by: Timothy Seah <[email protected]>

TimothySeah requested a review from justinvyu August 30, 2025 00:21

TimothySeah marked this pull request as ready for review August 30, 2025 00:21

TimothySeah requested a review from a team as a code owner August 30, 2025 00:21

ray-gardener bot added the train Ray Train Related Issue label Aug 30, 2025

TimothySeah added 2 commits September 2, 2025 15:00

Merge remote-tracking branch 'upstream/master' into tseah/async-check…

8edb5b6

…point

fix merge conflict

4a70d28

Signed-off-by: Timothy Seah <[email protected]>

TimothySeah force-pushed the tseah/async-checkpoint branch from daac015 to 4a70d28 Compare September 2, 2025 23:02

TimothySeah mentioned this pull request Sep 3, 2025

[train][checkpoint] Add checkpoint_upload_function to ray.train.report #56208

Merged

justinvyu reviewed Sep 4, 2025

View reviewed changes

TimothySeah added 2 commits September 5, 2025 17:12

address comments

7b0ca01

Signed-off-by: Timothy Seah <[email protected]>

Merge remote-tracking branch 'upstream/master' into tseah/async-check…

b03cce7

…point

TimothySeah requested a review from justinvyu September 6, 2025 00:34

justinvyu reviewed Sep 7, 2025

View reviewed changes

add bazel target

43cea68

Signed-off-by: Timothy Seah <[email protected]>

justinvyu approved these changes Sep 11, 2025

View reviewed changes

TimothySeah added the go add ONLY when ready to merge, run all tests label Sep 11, 2025

justinvyu merged commit e88b3f8 into ray-project:master Sep 11, 2025
6 checks passed

snorkelopstesting2-coder mentioned this pull request Oct 11, 2025

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report snorkel-marlin-repos/ray-project_ray_pr_55637_447d02b5-7e11-4405-a671-dda84b6d0cd5#1

Merged

snorkelopstesting2-coder mentioned this pull request Oct 11, 2025

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report snorkel-marlin-repos/ray-project_ray_pr_55637_ee6161cb-81fb-4ae0-aa4e-9863e0f24e2e#1

Merged

snorkelopstesting4-web mentioned this pull request Oct 22, 2025

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report snorkel-marlin-repos/ray-project_ray_pr_55637_0bca291d-6f76-4d2c-902c-8966b03778dc#1

Merged

Conversation

TimothySeah commented Aug 15, 2025 • edited by justinvyu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Summary

API Changes

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kushalthaman commented Aug 22, 2025

Uh oh!

TimothySeah commented Aug 22, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TimothySeah commented Aug 15, 2025 •

edited by justinvyu

Loading