Skip to content

Pulling upstream#1

Merged
arashashari merged 90 commits intoarashashari:masterfrom
deepspeedai:master
Sep 2, 2020
Merged

Pulling upstream#1
arashashari merged 90 commits intoarashashari:masterfrom
deepspeedai:master

Conversation

@arashashari
Copy link
Owner

No description provided.

arashashari and others added 30 commits May 18, 2020 09:33
* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS

Co-authored-by: Tunji Ruwase <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Elton Zheng <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: yuxionghe <[email protected]>
Co-authored-by: Arash Ashari <[email protected]>
* BERT title
* updates to support fp32 grad clipping and disable max_grad_norm
* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather
Contiguous Gradients should be set to false by default. Its not useful unless the model is very large
* add support for predivide as a flag
* add predivide json config, remove allgather_disable (as it's not currently used anymore)
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
* fix: typo in code docs

* more pythonic code
* Transformer kernels release

Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Elton Zheng <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: RezaYazdaniAminabadi <[email protected]>
Co-authored-by: Tunji Ruwase <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Elton Zheng <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: RezaYazdaniAminabadi <[email protected]>
Co-authored-by: Tunji Ruwase <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
jeffra and others added 29 commits July 24, 2020 10:21
* fix nv_peer_mem version in dockerfile

* fix security issue, remove pillow dependency (this is only needed for cifar example which has its own requirements.txt)
mpu object is bound to the class instance.. 

the if statement uses  `self.mpu'  but just `mpu` is called in the following lines.. 

This raises a NameError
The parenthesis alter the evaluation of the assert() and it will always evaluate to True.
Add webinar on-demand links and update readme
* add fix and tests for get_lr from lr_scheduler before training starts
* update fan out flag for pdsh
* turn off multi-node launch if only 1 node
* Update deepspeed_checkpointing.py

* formatting

Co-authored-by: Jeff Rasley <[email protected]>
* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation

* Gradient Accumulation support for Stage 2. Model tests added to test the feature

* formatting

* Update deepspeed_light.py

removing comment

* Update ds_config_func_bs8_zero1.json

reverting this file back. Its not needed for this PR

* defining baseline prefix

Co-authored-by: Jeff Rasley <[email protected]>
Renaming config files to gas3
* Sparse attn + ops/runtime refactor + v0.3.0

Co-authored-by: Arash Ashari <[email protected]>

Co-authored-by: Arash Ashari <[email protected]>
Remove llvm/cmake install for now, causing pyyaml issues
@arashashari arashashari merged commit a2984d0 into arashashari:master Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.