[BUG] RuntimeError: start (0) + length () exceeds dimension size (1).

**Describe the bug**
I'm trying to get deepspeed zero-infinity to run using NVME offloading. I initially got an assertion error which I believe is similar to this [AsyncIO Error](https://github.com/microsoft/DeepSpeed/issues/1070). I followed the guidelines in this thread and reduced the max_in_cpu size to be a multiple of 512 and I manage to no longer receive this error however I now receive the following error:

```
  File "run_summarization.py", line 799, in <module>
    main()
  File "run_summarization.py", line 677, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1422, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 2027, in training_step
    loss = self.deepspeed.backward(loss)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 1667, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2793, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1774, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 2049, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1810, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1868, in __reduce_and_partition_ipg_grads
Traceback (most recent call last):
  File "run_summarization.py", line 799, in <module>
    self.__partition_grads(self.__params_in_ipg_bucket, grad_partitions)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage3.py", line 1984, in __partition_grads
    grad_partition.numel())
RuntimeError: start (0) + length (1048576) exceeds dimension size (1).
    main()
```

**To Reproduce**
Steps to reproduce the behavior:
1. `git clone https://github.com/huggingface/transformers.git`
2. huggingface-cli login
3. `sed -i 's/load_optimizer_states=True/load_optimizer_states=False/g' ../transformers/src/transformers/trainer.py `
4. `sed -i 's/load_lr_scheduler_states=True/load_lr_scheduler_states=False/g' ../transformers/src/transformers/trainer.py `
5. create a json file called ds_config_zero.json with the following ds variables assigned:4. 
```
{

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },


    "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "../../workspace",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            },
            "offload_param": {
                "device": "nvme",
                "nvme_path": "../../workspace",
                "max_in_cpu": 99876864
            },
            "overlap_comm": true,
            "contiguous_gradients": true,
            "sub_group_size": 1e9,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "stage3_max_live_parameters": 1e9,
            "stage3_max_reuse_distance": 1e9,
            "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
```

6. run the following code:
```
deepspeed transformers/examples/pytorch/summarization/run_summarization.py
   --deepspeed ds_config_zero3.json \
    --model_name_or_path allenai/led-large-16384 \
    --per_device_train_batch_size 2 \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --predict_with_generate \
    --report_to wandb \
    --load_best_model_at_end True \
    --greater_is_better True \
    --evaluation_strategy steps \
    --metric_for_best_model rouge_average \
    --pad_to_max_length True \
    --max_source_length 1024 \
    --generation_max_length 512 \
    --save_steps 1200 \
    --eval_steps 400 \
    --logging_steps 400 \
    --dataset_name kaizan/amisum_v1 \
    --learning_rate 0.00005 \
    --num_train_epochs 10 \
    --weight_decay 0.5
```

**Expected behavior**
Expected to download the model, parallelise across 4 GPUs and then start training whilst offloading parameters to NVME storage 

**ds_report output**

```
[2022-06-08 20:49:19,034] [WARNING] [partition_parameters.py:54:<module>] unable to find torch.distributed._all_gather_base. will fall back to torch.distributed.all_gather which will result in suboptimal performance. please consider upgrading your pytorch installation.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']
torch version .................... 1.8.0
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.8, cuda 10.2, hip 0.0
```

**System info (please complete the following information):**
 - OS = Linux
 - GPU count = 4 TeslaV100S
 - Python = Python 3.6.9
 - Any other relevant info about your setup

**Launcher context**
`deepspeed` launcher

**Docker context**
N/A

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RuntimeError: start (0) + length () exceeds dimension size (1). #2002

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] RuntimeError: start (0) + length () exceeds dimension size (1). #2002

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions