HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li Chao-Yuan Wu Haoqi Fan Karttikeya Mangalam Bo Xiong Jitendra Malik Christoph Feichtenhofer

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Code Repositories

JunweiLiang/aicity_action
pytorch
Mentioned in GitHub
facebookresearch/SlowFast
Official
pytorch
Mentioned in GitHub
rwightman/pytorch-image-models
pytorch
Mentioned in GitHub
rajatmodi62/occludedactionbenchmark
pytorch
Mentioned in GitHub
3dperceptionlab/visual-wetlandbirds
pytorch
Mentioned in GitHub
facebookresearch/mvit
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400MViTv2-L (ImageNet-21k pretrain)
Acc@1: 86.1
Acc@5: 97.0
action-classification-on-kinetics-400MViT-B (train from scratch)
FLOPs (G) x views: 225x5
action-classification-on-kinetics-600MViTv2-L (ImageNet-21k pretrain)
Top-1 Accuracy: 87.9
Top-5 Accuracy: 97.9
action-classification-on-kinetics-600MViTv2-B (train from scratch)
Top-5 Accuracy: 97.2
action-classification-on-kinetics-600MViTv2-L (train from scratch)
Top-1 Accuracy: 85.5
action-classification-on-kinetics-600MViT-L (train from scratch)
GFLOPs: 206x5
action-classification-on-kinetics-700MViTv2-B
Top-1 Accuracy: 76.6
Top-5 Accuracy: 93.2
action-classification-on-kinetics-700MViTv2-L (ImageNet-21k pretrain)
Top-1 Accuracy: 79.4
Top-5 Accuracy: 94.9
action-classification-on-kinetics-700MoViNet-A6
Top-1 Accuracy: 79.4
action-recognition-in-videos-on-somethingMViT-L (IN-21K + Kinetics400 pretrain)
GFLOPs: 2828x3
action-recognition-in-videos-on-somethingMViTv2-L (IN-21K + Kinetics400 pretrain)
Parameters: 213.1
Top-1 Accuracy: 73.3
Top-5 Accuracy: 94.1
action-recognition-in-videos-on-somethingMViTv2-B (IN-21K + Kinetics400 pretrain)
Parameters: 51.1
Top-5 Accuracy: 93.4
action-recognition-in-videos-on-somethingMViT-B (IN-21K + Kinetics400 pretrain)
GFLOPs: 225x3
Top-1 Accuracy: 72.1
action-recognition-on-ava-v2-2MViTv2-L (IN21k, K700)
mAP: 34.4
image-classification-on-imagenetMViTv2-L (384 res)
GFLOPs: 140.2
Number of params: 218M
Top 1 Accuracy: 86.3%
image-classification-on-imagenetMViTv2-H (mageNet-21k pretrain)
GFLOPs: 120.6
Number of params: 667M
Top 1 Accuracy: 88%
image-classification-on-imagenetMViTv2-H (512 res, ImageNet-21k pretrain)
GFLOPs: 763.5
Number of params: 667M
Top 1 Accuracy: 88.8%
image-classification-on-imagenetMViTv2-T
GFLOPs: 4.7
Number of params: 24M
Top 1 Accuracy: 82.3%
image-classification-on-imagenetMViTv2-L (384 res, ImageNet-21k pretrain)
GFLOPs: 140.7
Number of params: 218M
Top 1 Accuracy: 88.4%
instance-segmentation-on-coco-minivalMViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
mask AP: 50.5
instance-segmentation-on-coco-minivalMViT-L (Mask R-CNN, single-scale)
mask AP: 46.2
instance-segmentation-on-coco-minivalMViTv2-L (Cascade Mask R-CNN, single-scale)
mask AP: 47.1
instance-segmentation-on-coco-minivalMViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
mask AP: 48.5
object-detection-on-coco-minivalMViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
box AP: 58.7
object-detection-on-coco-minivalMViTv2-L (Cascade Mask R-CNN, single-scale)
box AP: 54.3
object-detection-on-coco-minivalMViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
box AP: 56.1
object-detection-on-coco-minivalMViT-L (Mask R-CNN, single-scale, IN21k pre-train)
box AP: 52.7
object-detection-on-coco-oMViTV2-H (Cascade Mask R-CNN)
Average mAP: 30.9
Effective Robustness: 5.62

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp