-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm trying to get deepspeed zero-infinity to run using NVME offloading. I initially got an assertion error which I believe is similar to this AsyncIO Error. I followed the guidelines in this thread and reduced the max_in_cpu size to be a multiple of 512 and I manage to no longer receive this error however I now receive the following error:
File "run_summarization.py", line 799, in <module>
main()
File "run_summarization.py", line 677, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1422, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 2027, in training_step
loss = self.deepspeed.backward(loss)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 1667, in backward
self.optimizer.backward(loss)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1774, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2049, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1810, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1868, in __reduce_and_partition_ipg_grads
Traceback (most recent call last):
File "run_summarization.py", line 799, in <module>
self.__partition_grads(self.__params_in_ipg_bucket, grad_partitions)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1984, in __partition_grads
grad_partition.numel())
RuntimeError: start (0) + length (1048576) exceeds dimension size (1).
main()
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/huggingface/transformers.git- huggingface-cli login
sed -i 's/load_optimizer_states=True/load_optimizer_states=False/g' ../transformers/src/transformers/trainer.pysed -i 's/load_lr_scheduler_states=True/load_lr_scheduler_states=False/g' ../transformers/src/transformers/trainer.py- create a json file called ds_config_zero.json with the following ds variables assigned:4.
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "../../workspace",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "../../workspace",
"max_in_cpu": 99876864
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
- run the following code:
deepspeed transformers/examples/pytorch/summarization/run_summarization.py
--deepspeed ds_config_zero3.json \
--model_name_or_path allenai/led-large-16384 \
--per_device_train_batch_size 2 \
--output_dir output_dir \
--overwrite_output_dir \
--do_train \
--predict_with_generate \
--report_to wandb \
--load_best_model_at_end True \
--greater_is_better True \
--evaluation_strategy steps \
--metric_for_best_model rouge_average \
--pad_to_max_length True \
--max_source_length 1024 \
--generation_max_length 512 \
--save_steps 1200 \
--eval_steps 400 \
--logging_steps 400 \
--dataset_name kaizan/amisum_v1 \
--learning_rate 0.00005 \
--num_train_epochs 10 \
--weight_decay 0.5
Expected behavior
Expected to download the model, parallelise across 4 GPUs and then start training whilst offloading parameters to NVME storage
ds_report output
[2022-06-08 20:49:19,034] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']
torch version .................... 1.8.0
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2, hip 0.0
System info (please complete the following information):
- OS = Linux
- GPU count = 4 TeslaV100S
- Python = Python 3.6.9
- Any other relevant info about your setup
Launcher context
deepspeed launcher
Docker context
N/A
Additional context
N/A
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working