Skip to content

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report#55637

Merged
justinvyu merged 19 commits intoray-project:masterfrom
TimothySeah:tseah/async-checkpoint
Sep 11, 2025
Merged

[train][checkpoint] Add checkpoint_upload_mode to ray.train.report#55637
justinvyu merged 19 commits intoray-project:masterfrom
TimothySeah:tseah/async-checkpoint

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Aug 15, 2025

Summary

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.

  • Introduce per-worker checkpoint counters to preserve report order.
  • Use a thread pool to limit concurrent uploads and avoid OOM.
  • Wrap the training function to wait for pending uploads before exiting.
  • Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

Implementation Summary

This PR implements async checkpointing by

  • Adding a checkpoint_upload_mode to ray.train.report with three options
  • Maintaining internal-only num_reported_checkpoints and num_attempted_reported_checkpoints counters on the TrainContext
  • Regardless of which combinations of checkpoint_upload_mode the different Ray Train workers are doing, we want to upload checkpoints in the order they were ray.train.reported. Therefore, each Ray Train worker waits for its turn (num_reported_checkpoints == current_report_attempt_number - 1) before adding its checkpoint to the result queue.
  • Uploading too many checkpoints concurrently runs the risk of OOM-ing. I included a ThreadPoolExecutor to guard against adding too many checkpoint upload threads.
  • I changed run_train_fn to wrap the train_fn in train_fn_that_waits_for_threads because otherwise, we could be in the following situation: 1) train function exits with pending report threads and worker status is finished 2) controller sees finished status and shuts down worker group 3) result.fit does not return all the reported checkpoints/metrics
    • I decided to implement "early exit" in ThreadRunner but "wait for threads" as a wrapper function because in the former case, that is the cleanest way for a nested thread to cause the entire worker to exit early, but in this case, the target function is able to wait for the threads that it creates without complicating the ThreadRunner abstraction.

A few other notes:

  • I decided to only add Checkpoints (instead of Checkpoint ObjectRefs) to the result queue because:
    • If we went with the ObjectRef approach, the controller would create a Ray task that updates controller state. This "driver creates task that updates driver" pattern is unwieldy to implement.
    • Every worker must upload its checkpoint so it makes sense to confine this logic to the worker rather than making the controller even more complicated than it already is.
  • One interesting corner case I found while unit testing this PR is that async checkpoint uploads don't work with temporary directories because we might exit the temporary directory's scope before we kick off the checkpoint upload.

API Changes

This PR's only API changes are adding the following two arguments to ray.train.report:

  • checkpoint_upload_mode:
    • SYNC: synchronous upload - current and default behavior
    • ASYNC: asynchronous upload - the main goal of this PR
    • NONE: do not upload checkpoint - useful when users upload checkpoints themselves
  • delete_local_checkpoint_after_upload: Whether to delete the checkpoint after uploading it. Users generally won't need to set this since each checkpoint upload mode has its own default:
    • SYNC: False because users will generally use tempfile
    • ASYNC: True because users can't use tempfile - see previous section for explanation
    • NO_UPLOAD: False because there is no local directory to delete

Here's a simple example of this API in action:

def train_func():
    ...
    ray.train.report(
        metrics={},
        checkpoint=Checkpoint.from_directory(checkpoint_dir),
        checkpoint_upload_mode=CheckpointUploadMode.ASYNC,
    )

Testing

Looks like async reporting is indeed faster with the same loss on the pytorch ray train example: https://docs.ray.io/en/latest/train/getting-started-pytorch.html

Sync mode

(RayTrainWorker pid=6316, ip=10.0.180.165) Blocked times: [0.2751896381378174, 0.28496718406677246, 0.26192378997802734, 0.25046420097351074, 0.2681725025177002, 0.2644937038421631, 0.27478623390197754, 0.2887108325958252, 0.36760926246643066, 0.32657504081726074] with total 2.8628923892974854

{'loss': 0.04541657865047455, 'epoch': 9}

3m3s

Async mode

Async mode: (RayTrainWorker pid=8960, ip=10.0.185.254) Blocked times: [0.005418062210083008, 0.004388093948364258, 0.0045735836029052734, 0.004605531692504883, 0.004551887512207031, 0.0025453567504882812, 0.021490812301635742, 0.014325380325317383, 0.0050373077392578125, 0.004266977310180664] with total 0.07120299339294434

{'loss': 0.04778982326388359, 'epoch': 9}

2m57s with only ~0.22s blocking time when waiting for the last checkpoint upload:

I | 2025-08-29 20:44:25.066 | Blocked times: [0.005418062210083008, 0.004388093948364258, 0.0045735836029052734, 0.004605531692504883, 0.004551887512207031, 0.0025453567504882812, 0.021490812301635742, 0.014325380325317383, 0.0050373077392578125, 0.004266977310180664] with total 0.07120299339294434
-- | -- | --
I | 2025-08-29 20:44:25.289 | Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/my_run_name/checkpoint_2025-08-29_20-44-24.990364)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous checkpointing, a valuable feature for improving performance. The implementation is mostly well-structured, but I've identified a critical issue in the asynchronous handling logic that could lead to deadlocks under failure conditions. I've provided a detailed comment and a suggested fix for this. Additionally, there's a minor type hint mismatch that should be corrected for code clarity and correctness. It would also be beneficial to add tests for failure scenarios in asynchronous checkpointing to ensure the system's robustness.

TimothySeah and others added 2 commits August 20, 2025 17:43
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Timothy Seah <[email protected]>
@kushalthaman
Copy link

Looks great! Some suggestions for context.py:

  1. upload_checkpoint_and_report_function: wrap upload+report in try/except and thread cleanup in try/finally to avoid deadlocks if upload fails, e.g.
    def _upload_checkpoint_and_report(...):
        try:
            try:
                training_result = self._save_checkpoint(checkpoint_dir_name, metrics, checkpoint)
            except Exception as e:
                logger.exception("Async checkpoint upload failed; reporting failure to advance order.")
                training_result = _TrainingResult(checkpoint=None, metrics=metrics)
            self._wait_then_report(training_result, current_report_attempt_number)
        finally:
            with self.max_uploads_condition:
                self.ordered_checkpoint_upload_threads.pop(current_report_attempt_number, None)
                self.max_uploads_condition.notify_all()
  1. Make MAX_CHECKPOINT_UPLOAD_THREADS configurable through the RunConfig knob
  2. Add exponential backoff retries and a per-attempt upload timeout around save_checkpoint perhaps with structured logs per attempt
  3. Is the report order guarantee global across ranks or per-worker? Currently it seems to be per worker (per TrainContext). But if the controller requires global per-iteration ordering, maybe we should document the contract and add a test that mixes ASYNC/SYNC/NONE across ranks and assert that the controller output order matches report order.
  4. Adding progress reporting (bytes transferred, elapsed, speed) is useful, as well as aggregate metrics (success/failure counts, avg upload time)
  5. Maybe a lightweight ThreadPool or per-worker executor would be cleaner, instead of spawning many Threads

Happy to work on a follow-up PR branch with these fixes (try/finally + retries/timeouts + config), or help you with any of these.

@TimothySeah
Copy link
Contributor Author

Thanks for the awesome feedback!

  1. upload_checkpoint_and_report_function: wrap upload+report in try/except and thread cleanup in try/finally to avoid deadlocks if upload fails

Good callout - I am currently working on #55756, after which I will be able to propagate the exception up to the main thread. Once that's merged I will use that and add the try/finally to this PR.

  1. Make MAX_CHECKPOINT_UPLOAD_THREADS configurable through the RunConfig knob

I think this could be a good followup PR - feel free to create a new issue and assign it to yourself. I would suggest subclassing CheckpointConfig for Ray Train v2 and adding the knob there.

  1. Add exponential backoff retries and a per-attempt upload timeout around save_checkpoint perhaps with structured logs per attempt

I think this would also be a good followup PR - feel free to create a GitHub issue and assign it to yourself.

  1. Is the report order guarantee global across ranks or per-worker? Currently it seems to be per worker (per TrainContext). But if the controller requires global per-iteration ordering, maybe we should document the contract and add a test that mixes ASYNC/SYNC/NONE across ranks and assert that the controller output order matches report order.

Reports should happen in order of submission per worker but each report also forms a barrier across all workers (see the warning in https://docs.ray.io/en/latest/train/api/doc/ray.train.report.html).

What kind of documentation did you have in mind? I think https://github.com/ray-project/ray/blob/master/python/ray/train/v2/api/train_fn_utils.py#L36 mentions something to that effect but feel free to suggest modifications.

Can you elaborate on the test? I think https://github.com/ray-project/ray/pull/55637/files#diff-7df817b7f2f904c441481b97cc76db79e527b265b8b7a8fadbc78e722f16f208R147 tries to do something similar but let me know if you have any suggestions.

  1. Adding progress reporting (bytes transferred, elapsed, speed) is useful, as well as aggregate metrics (success/failure counts, avg upload time)

I think this could also be worth filing a GitHub issue and assigning to yourself. My only concerns are to make sure it doesn't slow down training and that it isn't too noisy e.g. maybe we can toggle it with an environment variable.

  1. Maybe a lightweight ThreadPool or per-worker executor would be cleaner, instead of spawning many Threads

After some more thought I think a ThreadPool would indeed be better - let me change the PR accordingly.

justinvyu pushed a commit that referenced this pull request Aug 26, 2025
The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@TimothySeah TimothySeah requested a review from justinvyu August 30, 2025 00:21
@TimothySeah TimothySeah marked this pull request as ready for review August 30, 2025 00:21
@TimothySeah TimothySeah requested a review from a team as a code owner August 30, 2025 00:21
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Aug 30, 2025
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff 💯

Signed-off-by: Timothy Seah <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Sep 11, 2025
@justinvyu justinvyu merged commit e88b3f8 into ray-project:master Sep 11, 2025
6 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…ay-project#55637)

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.
* Introduce per-worker checkpoint counters to preserve report order.
* Use a thread pool to limit concurrent uploads and avoid OOM.
* Wrap the training function to wait for pending uploads before exiting.
* Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: zac <[email protected]>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…ay-project#55637)

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.
* Introduce per-worker checkpoint counters to preserve report order.
* Use a thread pool to limit concurrent uploads and avoid OOM.
* Wrap the training function to wait for pending uploads before exiting.
* Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Marco Stephan <[email protected]>
justinvyu pushed a commit that referenced this pull request Oct 1, 2025
#56208)

After #55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…55637)

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.
* Introduce per-worker checkpoint counters to preserve report order.
* Use a thread pool to limit concurrent uploads and avoid OOM.
* Wrap the training function to wait for pending uploads before exiting.
* Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Douglas Strodtman <[email protected]>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_447d02b5-7e11-4405-a671-dda84b6d0cd5 that referenced this pull request Oct 11, 2025
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_447d02b5-7e11-4405-a671-dda84b6d0cd5 that referenced this pull request Oct 11, 2025
… to ray.train.report

Merged from original PR #55637
Original: ray-project/ray#55637
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_ee6161cb-81fb-4ae0-aa4e-9863e0f24e2e that referenced this pull request Oct 11, 2025
snorkelopstesting2-coder added a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_ee6161cb-81fb-4ae0-aa4e-9863e0f24e2e that referenced this pull request Oct 11, 2025
… to ray.train.report

Merged from original PR #55637
Original: ray-project/ray#55637
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Josh Kodi <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ay-project#55637)

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.
* Introduce per-worker checkpoint counters to preserve report order.
* Use a thread pool to limit concurrent uploads and avoid OOM.
* Wrap the training function to wait for pending uploads before exiting.
* Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
snorkelopstesting3-bot pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_0bca291d-6f76-4d2c-902c-8966b03778dc that referenced this pull request Oct 22, 2025
snorkelopstesting4-web added a commit to snorkel-marlin-repos/ray-project_ray_pr_55637_0bca291d-6f76-4d2c-902c-8966b03778dc that referenced this pull request Oct 22, 2025
… to ray.train.report

Merged from original PR #55637
Original: ray-project/ray#55637
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ject#55756)

The `ThreadRunner` is an abstraction used by Ray Train to capture errors
raised by the training function so they can be polled by the Ray Train
controller. This PR extends the `ThreadRunner` to also capture errors
raised by threads created by the training function e.g. async checkpoint
upload threads (ray-project#55637).

---------

Signed-off-by: Timothy Seah <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#55637)

Implement async checkpoint uploads in ray.train.report(..., checkpoint_upload_mode), supporting SYNC (default), ASYNC, and NO_UPLOAD.
* Introduce per-worker checkpoint counters to preserve report order.
* Use a thread pool to limit concurrent uploads and avoid OOM.
* Wrap the training function to wait for pending uploads before exiting.
* Add delete_local_checkpoint_after_upload to control temporary local directory cleanup.

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
ray-project#56208)

After ray-project#55637, `ray.train.report`
will allow users to upload checkpoints from disk to remote storage
asynchronously.

If they want to use framework-specific async checkpointing like
`torch.async_save`, they can manage `torch.async_save` themselves and
then call
`ray.train.report(…checkpoint_upload_mode=CheckpointUploadMode.NO_UPLOAD)`.

However, it would also be nice to allow `ray.train.report` to handle
rate limiting and report ordering for framework-specific async
checkpointing as well. This PR achieves this by exposing a
`checkpoint_upload_function` argument that can replace the
`persist_current_checkpoint` call.

---------

Signed-off-by: Timothy Seah <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants