DDP

DistributedDataParallel

说明

分布式数据并行DDP最小实现
适用单机多卡、多机多卡训练

运行

示例

//只有所有节点执行Shell命令，才开始训练
python ddp.py --nodes 节点数 --gpus 每个节点的GPU数量 --nr 当前节点序号  --ip 当前节点ip

单机多卡

节点ip=192.168.3.8

Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 1 --gpus 2 --nr 0  --ip 192.168.3.8

多机多卡

主节点ip=192.168.3.8

主节点Shell: CUDA_VISIBLE_DEVICES=0,1  python ddp.py --nodes 2 --gpus 2 --nr 0  --ip 192.168.3.8 
副节点Shell: CUDA_VISIBLE_DEVICES=0,1  python ddp.py --nodes 2 --gpus 2 --nr 1  --ip 192.168.3.8

总结问题

batch_size

有效batch = 每个GPU的batch * 总GPUs
验证、保存

验证：确保不同进程保存的log名称不同，最后只可视化rank=0。
保存：只保存rank=0的模型。
数据读取
- DataLoader采用Lmdb读取，若如下错误
```
TypeError: can't pickle Environment objects
```
  解决办法:DataLoader内num_workers=0
- DataLoader采用其他方式读取，若如下错误
```
Attribute:Can’t pickle local object ‘DataLoader.__init__.<locals>.<lambda>’
```
  解决办法: lambda x: Image.fromarray(x) 改为 Image.fromarray

同步BN

# 仅支持DDP
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

参考

Distributed data parallel training in Pytorch 推荐！

Name		Name	Last commit message	Last commit date
parent directory ..
ddp.py		ddp.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

DistributedDataParallel

说明

运行

示例

单机多卡

多机多卡

总结问题

参考

FilesExpand file tree

DDP

Directory actions

More options

Directory actions

More options

Latest commit

History

DDP

Folders and files

parent directory

readme.md

DistributedDataParallel

说明

运行

示例

单机多卡

多机多卡

总结问题

参考