Skip to content

LRRT tutorial#45

Merged
ShadenSmith merged 5 commits intomasterfrom
shaden/lrrt_tut
Feb 10, 2020
Merged

LRRT tutorial#45
ShadenSmith merged 5 commits intomasterfrom
shaden/lrrt_tut

Conversation

@ShadenSmith
Copy link
Contributor

No description provided.

@ShadenSmith ShadenSmith requested a review from tjruwase February 9, 2020 04:11
@ShadenSmith ShadenSmith added the documentation Improvements or additions to documentation label Feb 9, 2020
@ShadenSmith ShadenSmith merged commit 92514ac into master Feb 10, 2020
@ShadenSmith ShadenSmith deleted the shaden/lrrt_tut branch February 10, 2020 19:04
kouml pushed a commit to kouml/DeepSpeed that referenced this pull request Apr 3, 2020
jeffra added a commit that referenced this pull request May 19, 2020
Co-authored-by: yuxionghe <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
rraminen added a commit to rraminen/DeepSpeed that referenced this pull request Nov 18, 2021
delock referenced this pull request in delock/DeepSpeedSYCLSupport Sep 21, 2022
liamcli pushed a commit to determined-ai/DeepSpeed that referenced this pull request May 8, 2023
* Add SLURM launcher

Signed-off-by: Dashiell Stander <[email protected]>

* Need to import SlurmRunner

Signed-off-by: Dashiell Stander <[email protected]>

* Clean up the config JSON

Signed-off-by: Dashiell Stander <[email protected]>

* Properly clean up json configs

Signed-off-by: Dashiell Stander <[email protected]>

* runner

Signed-off-by: Dashiell Stander <[email protected]>

* Switch to using an argument

Signed-off-by: Dashiell Stander <[email protected]>

* Pre-commit

Signed-off-by: Dashiell Stander <[email protected]>

* Prevent clean-up when using slurm, add in hostfile

Signed-off-by: Dashiell Stander <[email protected]>

* Pass launcher in to autotuning jobs

Signed-off-by: Dashiell Stander <[email protected]>

* Pass slurm comment in

Signed-off-by: Dashiell Stander <[email protected]>

* Add a comment argument to DeepSpeed runner

Signed-off-by: Dashiell Stander <[email protected]>

* Switch slurm_comment to just comment

Signed-off-by: Dashiell Stander <[email protected]>

* Switch slurm_comment to just comment

Signed-off-by: Dashiell Stander <[email protected]>

* Use SLURM --nodelist instead of --include

Co-authored-by: Quentin Anthony <[email protected]>
Signed-off-by: Dashiell Stander <[email protected]>

* Use SLURM --nodelist instead of --include
>
>
> Co-authored-by: Quentin Anthony <[email protected]>

Signed-off-by: Dashiell Stander <[email protected]>

* Launcher args

Signed-off-by: Dashiell Stander <[email protected]>

* Debug print statement...

Signed-off-by: Dashiell Stander <[email protected]>

* Debug print statements...

Signed-off-by: Dashiell Stander <[email protected]>

* Debug print statements...

Signed-off-by: Dashiell Stander <[email protected]>

* Debug print statements...

Signed-off-by: Dashiell Stander <[email protected]>

* Debug print statements...

Signed-off-by: Dashiell Stander <[email protected]>

* user_config bug

Signed-off-by: Dashiell Stander <[email protected]>

* user_config bug

Signed-off-by: Dashiell Stander <[email protected]>

* Fix config dict

* Pydantic to dict

Signed-off-by: Dashiell Stander <[email protected]>

* Pydantic to dict

Signed-off-by: Dashiell Stander <[email protected]>

* Will it work now?

Signed-off-by: Dashiell Stander <[email protected]>

* Just make it a dict immediately

Signed-off-by: Dashiell Stander <[email protected]>

* Exclude unset things

Signed-off-by: Dashiell Stander <[email protected]>

* Add dilation to pooling flops profiler

Signed-off-by: Dashiell Stander <[email protected]>

* Adding return_indices...

Signed-off-by: Dashiell Stander <[email protected]>

* Do cleanup with SLURM.

Co-authored-by: Quentin Anthony <[email protected]>

* Do cleanup with SLURM.

Co-authored-by: Quentin Anthony <[email protected]>

* Horrific hack to get metrics.json

* Push pipeline grad tail fix

* No longer hardcode path

Signed-off-by: Dashiell Stander <[email protected]>

* Also pass in no_ssh_check

Signed-off-by: Dashiell Stander <[email protected]>

* Also pass in no_ssh_check

Signed-off-by: Dashiell Stander <[email protected]>

* Also pass in master_addr

Signed-off-by: Dashiell Stander <[email protected]>

* Stop hardcoding number of steps....

Signed-off-by: Dashiell Stander <[email protected]>

* detailed flops breakdown

Signed-off-by: Dashiell Stander <[email protected]>

* Fix autotuning reporting bug

Signed-off-by: Dashiell Stander <[email protected]>

* Fix autotuning reporting bug

Signed-off-by: Dashiell Stander <[email protected]>

* Actually off by a million, not a thousand

Signed-off-by: Dashiell Stander <[email protected]>

* Clean up debugging stuff

Signed-off-by: Dashiell Stander <[email protected]>

* Add JSRunner for summit launching on multiple nodes

* import JSRUN_LAUNCHER from constants

* Fix jsrun typo

* Update multinode_runner.py (deepspeedai#45)

* add CUDA_VISIBLE_DEVICES to jsrunner

---------

Signed-off-by: Dashiell Stander <[email protected]>
Signed-off-by: Dashiell Stander <[email protected]>
Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Quentin TastyRice <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: MLRichter <[email protected]>
Co-authored-by: Stella Biderman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants