Error of Pretraining User Defined Dataset

Hi：

I want to use TAP to pretrain model on my dataset, and I prepare the dataset following your data format. 

But when I try to pretrain the model with distributed setting (use only one GPU is fine), I encounter the following error:

`2022-04-15T14:13:50 INFO: m4c_textvqa:, 73100/96000, train/total_loss: 1.6139 (2.9855), train/m4c_textvqa/pretrainonly_m4c_decoding_bce_with_mask: 1.6139 (2.9855), train/m4c_textvqa/maskpred_accuracy: 0.8486 (0.7797), val/total_loss: 4.3474, val/m4c_textvqa/pretrainonly_m4c_decoding_bce_with_mask: 4.3474 (4.3474), val/m4c_textvqa/maskpred_accuracy: 0.7328, max mem: 7456.0, lr: 0.00001, time: 02m 47s 324ms, eta: 10h 43m 43s 839ms
2022-04-15T14:13:50 INFO: Batch Size of one GPU:16
2022-04-15T14:14:40 ERROR: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7ff58f8d1193 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x7ff5dae6ff81 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0xa0f14a (0x7ff5dae5c14a in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x2961c4 (0x7ff5da6e31c4 in /home/pai/envs/vqa/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyCFunction_FastCallDict + 0x262 (0x56330c484562 in /home/pai/envs/vqa/bin/python)
frame #5: <unknown function> + 0x183135 (0x56330c4b0135 in /home/pai/envs/vqa/bin/python)
...`

Training loss drops as expected, but after several iterations (73100 iters in the above case), the above error happened. Which is very strange, since the kind of error should happened before the training starts.

Have you ever encounter the above problem? Or could you help me solve the problem?

Thanks very much.

Kang 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error of Pretraining User Defined Dataset #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error of Pretraining User Defined Dataset #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions