HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

DaViT: Dual Attention Vision Transformers

Mingyu Ding Bin Xiao Noel Codella Ping Luo Jingdong Wang Lu Yuan

DaViT: Dual Attention Vision Transformers

Abstract

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-imagenetDaViT-B (ImageNet-22k)
GFLOPs: 46.4
Number of params: 87.9M
Top 1 Accuracy: 86.9%
image-classification-on-imagenetDaViT-T
Number of params: 28.3M
Top 1 Accuracy: 82.8%
image-classification-on-imagenetDaViT-B
GFLOPs: 15.5
Number of params: 87.9M
Top 1 Accuracy: 84.6%
image-classification-on-imagenetDaViT-L (ImageNet-22k)
GFLOPs: 103
Number of params: 196.8M
Top 1 Accuracy: 87.5%
image-classification-on-imagenetDaViT-H
GFLOPs: 334
Number of params: 362M
Top 1 Accuracy: 90.2%
image-classification-on-imagenetDaViT-G
GFLOPs: 1038
Number of params: 1437M
Top 1 Accuracy: 90.4%
instance-segmentation-on-coco-minivalDaViT-T (Mask R-CNN, 36 epochs)
mask AP: 44.3
medical-image-classification-on-imagenetDaViT-T
GFLOPs: 4.5
medical-image-classification-on-imagenetDaViT-S
GFLOPs: 8.8
Top 1 Accuracy: 84.2%
object-detection-on-coco-minivalDaViT-T (Mask R-CNN, 36 epochs)
box AP: 49.9
semantic-segmentation-on-ade20kDaViT-T
Validation mIoU: 46.3
semantic-segmentation-on-ade20kDaViT-B
Validation mIoU: 49.4
semantic-segmentation-on-ade20k-valDaViT-B (UperNet)
mIoU: 46.3
semantic-segmentation-on-ade20k-valDaViT-S (UperNet)
mIoU: 48.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp