This repo contains the official detection and segmentation implementation of paper "DaViT: Dual Attention Vision Transformer", by Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. See Introduction.md for an introduction.
The official implementation for image classification will be released in https://github.com/microsoft/DaViT.
Python3, PyTorch>=1.8.0, torchvision>=0.7.0 are required for the current codebase.
# An example on CUDA 10.2
pip install torch===1.9.0+cu102 torchvision===0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install thop pyyaml fvcore pillow==8.3.2-
cd mmdet& install mmcv/mmdet# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -r requirements/build.txt pip install -v -e . # or "python setup.py develop"
-
mkdir data& Prepare the dataset in data/coco/ (Format: ROOT/mmdet/data/coco/annotations, train2017, val2017) -
Finetune on COCO
bash tools/dist_train.sh configs/davit_retinanet_1x_coco.py 8 \ --cfg-options model.pretrained=PRETRAINED_MODEL_PATH
-
cd mmseg& install mmcv/mmseg# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -e .
-
mkdir data& Prepare the dataset in data/ade/ (Format: ROOT/mmseg/data/ADEChallengeData2016) -
Finetune on ADE
bash tools/dist_train.sh configs/upernet_davit_512x512_160k_ade20k.py 8 \ --options model.pretrained=PRETRAINED_MODEL_PATH
-
Multi-scale Testing
bash tools/dist_test.sh configs/upernet_davit_512x512_160k_ade20k.py \ TRAINED_MODEL_PATH 8 --aug-test --eval mIoU
Image Classification on ImageNet-1K
| Model | Pretrain | Resolution | acc@1 | acc@5 | #params | FLOPs | Checkpoint | Log |
|---|---|---|---|---|---|---|---|---|
| DaViT-T | IN-1K | 224 | 82.8 | 96.2 | 28.3M | 4.5G | download | log |
| DaViT-S | IN-1K | 224 | 84.2 | 96.9 | 49.7M | 8.8G | download | log |
| DaViT-B | IN-1K | 224 | 84.6 | 96.9 | 87.9M | 15.5G | download | log |
Object Detection and Instance Segmentation on COCO
| Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | mask mAP | Checkpoint | Log |
|---|---|---|---|---|---|---|---|---|
| DaViT-T | ImageNet-1K | 1x | 47.8M | 263G | 45.0 | 41.1 | download | log |
| DaViT-T | ImageNet-1K | 3x | 47.8M | 263G | 47.4 | 42.9 | download | log |
| DaViT-S | ImageNet-1K | 1x | 69.2M | 351G | 47.7 | 42.9 | download | log |
| DaViT-S | ImageNet-1K | 3x | 69.2M | 351G | 49.5 | 44.3 | download | log |
| DaViT-B | ImageNet-1K | 1x | 107.3M | 491G | 48.2 | 43.3 | download | log |
| DaViT-B | ImageNet-1K | 3x | 107.3M | 491G | 49.9 | 44.6 | download | log |
| Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | Checkpoint | Log |
|---|---|---|---|---|---|---|---|
| DaViT-T | ImageNet-1K | 1x | 38.5M | 244G | 44.0 | download | log |
| DaViT-T | ImageNet-1K | 3x | 38.5M | 244G | 46.5 | download | log |
| DaViT-S | ImageNet-1K | 1x | 59.9M | 332G | 46.0 | download | log |
| DaViT-S | ImageNet-1K | 3x | 59.9M | 332G | 48.2 | download | log |
| DaViT-B | ImageNet-1K | 1x | 98.5M | 471G | 46.7 | download | log |
| DaViT-B | ImageNet-1K | 3x | 98.5M | 471G | 48.7 | download | log |
Semantic Segmentation on ADE20K
| Backbone | Pretrain | Method | Resolution | Iters | #params | FLOPs | mIoU | Checkpoint | Log |
|---|---|---|---|---|---|---|---|---|---|
| DaViT-T | ImageNet-1K | UPerNet | 512x512 | 160k | 60M | 940G | 46.3 | download | log |
| DaViT-S | ImageNet-1K | UPerNet | 512x512 | 160k | 81M | 1030G | 48.8 | download | log |
| DaViT-B | ImageNet-1K | UPerNet | 512x512 | 160k | 121M | 1175G | 49.4 | download | log |
If you find this repo useful to your project, please consider citing it with the following bib:
@inproceedings{ding2022davit,
title={DaViT: Dual Attention Vision Transformer},
author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
booktitle={ECCV},
year={2022},
}
Our codebase is built based on timm, MMDetection, MMSegmentation. We thank the authors for the nicely organized code!