[tune] Fault tolerance improvements by ujvl · Pull Request #5877 · ray-project/ray

ujvl · 2019-10-10T00:43:26Z

Why are these changes needed?

Checkpoint committing

Checkpointing fails when workers die during a checkpoint. This is because the worker returns the path before the checkpoint is pulled onto the driver. To fix this, the driver now rsyncs down the checkpoint synchronously by default and waits before setting the newest checkpoint.
Note that this may add significant latency on the driver's critical path so it can be turned off (relaxed trial FT guarantees).

Checkpoint GC

Checkpoints are now deleted automatically post-sync, using an rsync flag.
Checkpoints on the driver are garbage collected according to the policy defined by the user.
This PR also fixes how checkpoints are ranked so that the wrong checkpoint isn't deleted (using a PQ).
It also allows using checkpoint_score_attr without keep_checkpoints_num set (ie if you want to rank your best checkpoints differently but don't need to bound the # of them stored).

Other

Misc bug fixes that cause incorrect recovery (eg: not setting new the node IP on a recovered trial)
Improved/more helpful logging messages.

TODO

add tests

Related issue number

Closes #5127, #5549 and #5827
Also potentially #4784

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

python/ray/tune/util.py

richardliaw · 2019-10-10T20:27:05Z

Wait why not just use the built-in queue in Python? The locking shouldn’t make a huge difference here right?

…

On Thu, Oct 10, 2019 at 11:57 AM Ujval Misra ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/tune/util.py <#5877 (comment)>: > @@ -97,6 +98,43 @@ def stop(self): self.stopped = True +class PriorityQueue(object): i can move all of it into Trainable, just a bit cleaner this way — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#5877?email_source=notifications&email_token=ABCRZZP3O67PHZWK7GHXLO3QN53I7A5CNFSM4I7GMHMKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHS7P7I#discussion_r333683900>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZKQ6Q3SBOZU72FNBILQN53I7ANCNFSM4I7GMHMA> .

python/ray/tune/ray_trial_executor.py

richardliaw · 2019-10-10T21:44:35Z

No, I just meant like only the checkpoint deletion stuff.

…

On Thu, Oct 10, 2019 at 1:48 PM Ujval Misra ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/tune/ray_trial_executor.py <#5877 (comment)>: > @@ -115,7 +112,13 @@ def logger_creator(config): # Logging for trials is handled centrally by TrialRunner, so # configure the remote runner to use a noop-logger. - return cls.remote(config=trial.config, logger_creator=logger_creator) + remote_runner = cls.remote(trial.config, logger_creator=logger_creator) + tune_config = { + "keep_checkpoints_num": trial.keep_checkpoints_num, + "checkpoint_score_attr": trial.checkpoint_score_attr, + } + remote_runner._tune_setup.remote(tune_config) Well so we'd have to then pull out all of save, save_to_object, restore and restore_from_object. There's no reason we can't do that, it just seemed like a lot for this PR but I can make those changes. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#5877?email_source=notifications&email_token=ABCRZZLEBA5QINF57RX3CSTQN6IJ3A5CNFSM4I7GMHMKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHTNZ4I#discussion_r333728302>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCRZZPNBHHFRTEPK2JDDSLQN6IJ3ANCNFSM4I7GMHMA> .

richardliaw · 2019-10-11T18:55:43Z

python/ray/tune/trial.py

do we need this logic here?

We need to keep track of the best checkpoint for recovery, this is just the previous compare_checkpoints + the checkpoint-setting code previously in ray_trial_executor::save written as a single function.

AmplabJenkins · 2019-10-11T23:21:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17590/
Test FAILed.

AmplabJenkins · 2019-10-27T06:23:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17948/
Test FAILed.

AmplabJenkins · 2019-10-27T07:15:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17949/
Test FAILed.

richardliaw · 2019-10-27T08:50:05Z

python/ray/tune/ray_trial_executor.py

we dont need to commit memory checkpoints i think

yeah i wasn't, just a func naming issue—changed to be more clear

AmplabJenkins · 2019-10-31T11:07:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18028/
Test FAILed.

AmplabJenkins · 2019-10-31T23:21:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18029/
Test FAILed.

AmplabJenkins · 2019-11-01T05:54:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18040/
Test FAILed.

AmplabJenkins · 2019-11-01T15:01:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18053/
Test FAILed.

AmplabJenkins · 2019-11-01T17:35:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18054/
Test FAILed.

AmplabJenkins · 2019-11-01T19:22:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18056/
Test FAILed.

richardliaw · 2019-11-01T23:42:51Z

jenkins test tune

AmplabJenkins · 2019-11-02T00:36:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/280/
Tune tests passed.

AmplabJenkins · 2019-11-04T19:58:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18132/
Test FAILed.

AmplabJenkins · 2019-11-04T22:20:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18142/
Test FAILed.

ujvl · 2019-11-15T23:53:30Z

jenkins test tune

AmplabJenkins · 2019-11-16T00:34:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/306/
Tune tests failed.

AmplabJenkins · 2019-11-16T00:39:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18506/
Test PASSed.

AmplabJenkins · 2019-11-16T02:00:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18509/
Test PASSed.

python/ray/tune/checkpoint_manager.py

python/ray/tune/util.py

richardliaw · 2019-11-16T22:19:59Z

python/ray/worker.py

-            # TODO(ujvl): Remove check when local mode moved to core worker.
-            if timeout is not None:
-                raise ValueError(
-                    "`get` must be called with timeout=None in local mode.")


can you give me context for this change?

i added it in the ray.get PR but realized it actually isn't necessary—for local mode it's just going to return immediately. This is also consistent with calling wait w/ timeout.

python/ray/tune/tests/test_checkpoint_manager.py

richardliaw · 2019-11-17T05:01:17Z

also, lint is failing

AmplabJenkins · 2019-11-18T05:49:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18554/
Test PASSed.

ujvl changed the title ~~Trigger checkpoint deletes locally in Trainable~~ [tune] [WIP] Trigger checkpoint deletes locally in Trainable Oct 10, 2019

ujvl force-pushed the tune-delete-checkpoint-locally branch from e0fc525 to ec04ff1 Compare October 10, 2019 01:02

ujvl requested a review from richardliaw October 10, 2019 03:04

richardliaw reviewed Oct 10, 2019

View reviewed changes

python/ray/tune/util.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 10, 2019

View reviewed changes

python/ray/tune/ray_trial_executor.py Outdated Show resolved Hide resolved

richardliaw reviewed Oct 11, 2019

View reviewed changes

ujvl changed the title ~~[tune] [WIP] Trigger checkpoint deletes locally in Trainable~~ [tune] [WIP] Checkpoint commits and garbage collection Oct 27, 2019

richardliaw reviewed Oct 27, 2019

View reviewed changes

ujvl force-pushed the tune-delete-checkpoint-locally branch from ce4f3ca to a1670e3 Compare October 31, 2019 07:33

ujvl force-pushed the tune-delete-checkpoint-locally branch from 3317c8d to 924f49c Compare October 31, 2019 23:42

ujvl force-pushed the tune-delete-checkpoint-locally branch from 00a5e55 to 8aefc30 Compare November 1, 2019 13:47

ujvl changed the title ~~[tune] [WIP] Checkpoint commits and garbage collection~~ [tune] Checkpoint commits and garbage collection Nov 2, 2019

ujvl force-pushed the tune-delete-checkpoint-locally branch 3 times, most recently from 3e76e44 to 0bdaf09 Compare November 4, 2019 20:47

ujvl added 11 commits November 15, 2019 13:28

Move handling of no available trials to ray_trial_executor (#1)

45d044b

Fix formatting bug, lint.

295e4ec

Addressed Richard's comments

09a4530

Revert tests.

cb0da32

fix rebase

cbe5042

Fix trial location reporting.

6cfb818

Fix test

eb95599

Fix lint

1e44d8b

Rebase, use ray.get w/ timeout, lint.

36ae6b8

lint

fdeb7ea

fix rebase

e9336ec

ujvl force-pushed the tune-delete-checkpoint-locally branch from 5d11f47 to e9336ec Compare November 15, 2019 21:30

richardliaw reviewed Nov 16, 2019

View reviewed changes

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved

richardliaw reviewed Nov 16, 2019

View reviewed changes

python/ray/tune/util.py Outdated Show resolved Hide resolved

richardliaw reviewed Nov 16, 2019

View reviewed changes

python/ray/tune/tests/test_checkpoint_manager.py Show resolved Hide resolved

ujvl added 2 commits November 17, 2019 15:49

Address richard's comments

59ce20f

Merge branch 'master' into tune-delete-checkpoint-locally

4b32cc1

ujvl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 18, 2019

richardliaw approved these changes Nov 18, 2019

View reviewed changes

richardliaw merged commit 2965dc1 into ray-project:master Nov 18, 2019

ujvl mentioned this pull request Nov 18, 2019

[WIP] [tune] Fix logging after trial failure and restore #5264

Closed

2 tasks

ujvl deleted the tune-delete-checkpoint-locally branch December 13, 2019 00:04

Conversation

ujvl commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

richardliaw commented Oct 10, 2019 via email

Uh oh!

Uh oh!

richardliaw commented Oct 10, 2019 via email

Uh oh!

richardliaw Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

ujvl Oct 11, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 11, 2019

Uh oh!

AmplabJenkins commented Oct 27, 2019

Uh oh!

AmplabJenkins commented Oct 27, 2019

Uh oh!

richardliaw Oct 27, 2019

Choose a reason for hiding this comment

Uh oh!

ujvl Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 31, 2019

Uh oh!

AmplabJenkins commented Oct 31, 2019

Uh oh!

AmplabJenkins commented Nov 1, 2019

Uh oh!

AmplabJenkins commented Nov 1, 2019

Uh oh!

AmplabJenkins commented Nov 1, 2019

Uh oh!

AmplabJenkins commented Nov 1, 2019

Uh oh!

richardliaw commented Nov 1, 2019

Uh oh!

AmplabJenkins commented Nov 2, 2019

Uh oh!

AmplabJenkins commented Nov 4, 2019

Uh oh!

AmplabJenkins commented Nov 4, 2019

Uh oh!

ujvl commented Nov 15, 2019

Uh oh!

AmplabJenkins commented Nov 16, 2019

Uh oh!

AmplabJenkins commented Nov 16, 2019

Uh oh!

AmplabJenkins commented Nov 16, 2019

Uh oh!

Uh oh!

Uh oh!

richardliaw Nov 16, 2019

Choose a reason for hiding this comment

Uh oh!

ujvl Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richardliaw commented Nov 17, 2019

Uh oh!

AmplabJenkins commented Nov 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ujvl commented Oct 10, 2019 •

edited

Loading

ujvl Nov 17, 2019 •

edited

Loading