Skip to content

[tune] Fault tolerance improvements#5877

Merged
richardliaw merged 37 commits intoray-project:masterfrom
ujvl:tune-delete-checkpoint-locally
Nov 18, 2019
Merged

[tune] Fault tolerance improvements#5877
richardliaw merged 37 commits intoray-project:masterfrom
ujvl:tune-delete-checkpoint-locally

Conversation

@ujvl
Copy link
Contributor

@ujvl ujvl commented Oct 10, 2019

Why are these changes needed?

Checkpoint committing

  • Checkpointing fails when workers die during a checkpoint. This is because the worker returns the path before the checkpoint is pulled onto the driver. To fix this, the driver now rsyncs down the checkpoint synchronously by default and waits before setting the newest checkpoint.
  • Note that this may add significant latency on the driver's critical path so it can be turned off (relaxed trial FT guarantees).

Checkpoint GC

  • Checkpoints are now deleted automatically post-sync, using an rsync flag.
  • Checkpoints on the driver are garbage collected according to the policy defined by the user.
  • This PR also fixes how checkpoints are ranked so that the wrong checkpoint isn't deleted (using a PQ).
  • It also allows using checkpoint_score_attr without keep_checkpoints_num set (ie if you want to rank your best checkpoints differently but don't need to bound the # of them stored).

Other

  • Misc bug fixes that cause incorrect recovery (eg: not setting new the node IP on a recovered trial)
  • Improved/more helpful logging messages.

TODO

  • add tests

Related issue number

Closes #5127, #5549 and #5827
Also potentially #4784

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

@ujvl ujvl changed the title Trigger checkpoint deletes locally in Trainable [tune] [WIP] Trigger checkpoint deletes locally in Trainable Oct 10, 2019
@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch from e0fc525 to ec04ff1 Compare October 10, 2019 01:02
@ujvl ujvl requested a review from richardliaw October 10, 2019 03:04
@richardliaw
Copy link
Contributor

richardliaw commented Oct 10, 2019 via email

@richardliaw
Copy link
Contributor

richardliaw commented Oct 10, 2019 via email

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this logic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep track of the best checkpoint for recovery, this is just the previous compare_checkpoints + the checkpoint-setting code previously in ray_trial_executor::save written as a single function.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17590/
Test FAILed.

@ujvl ujvl changed the title [tune] [WIP] Trigger checkpoint deletes locally in Trainable [tune] [WIP] Checkpoint commits and garbage collection Oct 27, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17948/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17949/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need to commit memory checkpoints i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i wasn't, just a func naming issue—changed to be more clear

@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch from ce4f3ca to a1670e3 Compare October 31, 2019 07:33
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18028/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18029/
Test FAILed.

@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch from 3317c8d to 924f49c Compare October 31, 2019 23:42
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18040/
Test FAILed.

@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch from 00a5e55 to 8aefc30 Compare November 1, 2019 13:47
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18053/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18054/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18056/
Test FAILed.

@richardliaw
Copy link
Contributor

jenkins test tune

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/280/
Tune tests passed.

@ujvl ujvl changed the title [tune] [WIP] Checkpoint commits and garbage collection [tune] Checkpoint commits and garbage collection Nov 2, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18132/
Test FAILed.

@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch 3 times, most recently from 3e76e44 to 0bdaf09 Compare November 4, 2019 20:47
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18142/
Test FAILed.

@ujvl ujvl force-pushed the tune-delete-checkpoint-locally branch from 5d11f47 to e9336ec Compare November 15, 2019 21:30
@ujvl
Copy link
Contributor Author

ujvl commented Nov 15, 2019

jenkins test tune

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/306/
Tune tests failed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18506/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18509/
Test PASSed.

# TODO(ujvl): Remove check when local mode moved to core worker.
if timeout is not None:
raise ValueError(
"`get` must be called with timeout=None in local mode.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give me context for this change?

Copy link
Contributor Author

@ujvl ujvl Nov 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added it in the ray.get PR but realized it actually isn't necessary—for local mode it's just going to return immediately. This is also consistent with calling wait w/ timeout.

@richardliaw
Copy link
Contributor

also, lint is failing

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/18554/
Test PASSed.

@ujvl ujvl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 18, 2019
@richardliaw richardliaw merged commit 2965dc1 into ray-project:master Nov 18, 2019
@ujvl ujvl deleted the tune-delete-checkpoint-locally branch December 13, 2019 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tune] keep_checkpoints_num logic should be in trainable

3 participants