[tune] Fault tolerance improvements#5877
Conversation
e0fc525 to
ec04ff1
Compare
|
Wait why not just use the built-in queue in Python? The locking shouldn’t
make a huge difference here right?
…On Thu, Oct 10, 2019 at 11:57 AM Ujval Misra ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/tune/util.py
<#5877 (comment)>:
> @@ -97,6 +98,43 @@ def stop(self):
self.stopped = True
+class PriorityQueue(object):
i can move all of it into Trainable, just a bit cleaner this way
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#5877?email_source=notifications&email_token=ABCRZZP3O67PHZWK7GHXLO3QN53I7A5CNFSM4I7GMHMKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHS7P7I#discussion_r333683900>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZKQ6Q3SBOZU72FNBILQN53I7ANCNFSM4I7GMHMA>
.
|
|
No, I just meant like only the checkpoint deletion stuff.
…On Thu, Oct 10, 2019 at 1:48 PM Ujval Misra ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/tune/ray_trial_executor.py
<#5877 (comment)>:
> @@ -115,7 +112,13 @@ def logger_creator(config):
# Logging for trials is handled centrally by TrialRunner, so
# configure the remote runner to use a noop-logger.
- return cls.remote(config=trial.config, logger_creator=logger_creator)
+ remote_runner = cls.remote(trial.config, logger_creator=logger_creator)
+ tune_config = {
+ "keep_checkpoints_num": trial.keep_checkpoints_num,
+ "checkpoint_score_attr": trial.checkpoint_score_attr,
+ }
+ remote_runner._tune_setup.remote(tune_config)
Well so we'd have to then pull out all of save, save_to_object, restore
and restore_from_object. There's no reason we can't do that, it just
seemed like a lot for this PR but I can make those changes.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#5877?email_source=notifications&email_token=ABCRZZLEBA5QINF57RX3CSTQN6IJ3A5CNFSM4I7GMHMKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHTNZ4I#discussion_r333728302>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZPNBHHFRTEPK2JDDSLQN6IJ3ANCNFSM4I7GMHMA>
.
|
python/ray/tune/trial.py
Outdated
There was a problem hiding this comment.
do we need this logic here?
There was a problem hiding this comment.
We need to keep track of the best checkpoint for recovery, this is just the previous compare_checkpoints + the checkpoint-setting code previously in ray_trial_executor::save written as a single function.
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
There was a problem hiding this comment.
we dont need to commit memory checkpoints i think
There was a problem hiding this comment.
yeah i wasn't, just a func naming issue—changed to be more clear
ce4f3ca to
a1670e3
Compare
|
Test FAILed. |
|
Test FAILed. |
3317c8d to
924f49c
Compare
|
Test FAILed. |
00a5e55 to
8aefc30
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
jenkins test tune |
|
Test PASSed. |
|
Test FAILed. |
3e76e44 to
0bdaf09
Compare
|
Test FAILed. |
5d11f47 to
e9336ec
Compare
|
jenkins test tune |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
| # TODO(ujvl): Remove check when local mode moved to core worker. | ||
| if timeout is not None: | ||
| raise ValueError( | ||
| "`get` must be called with timeout=None in local mode.") |
There was a problem hiding this comment.
can you give me context for this change?
There was a problem hiding this comment.
i added it in the ray.get PR but realized it actually isn't necessary—for local mode it's just going to return immediately. This is also consistent with calling wait w/ timeout.
|
also, lint is failing |
|
Test PASSed. |
Why are these changes needed?
Checkpoint committing
Checkpoint GC
checkpoint_score_attrwithoutkeep_checkpoints_numset (ie if you want to rank your best checkpoints differently but don't need to bound the # of them stored).Other
TODO
Related issue number
Closes #5127, #5549 and #5827
Also potentially #4784
Checks
scripts/format.shto lint the changes in this PR.