HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multiscale Vision Transformers

Haoqi Fan Bo Xiong Karttikeya Mangalam Yanghao Li Zhicheng Yan Jitendra Malik Christoph Feichtenhofer

Multiscale Vision Transformers

Abstract

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

Code Repositories

junweiliang/multitrain
pytorch
Mentioned in GitHub
facebookresearch/SlowFast
Official
pytorch
Mentioned in GitHub
facebookresearch/pytorchvideo
pytorch
Mentioned in GitHub
wangjk666/stts
pytorch
Mentioned in GitHub
rohanshad/cmr_transformer
pytorch
Mentioned in GitHub
facebookresearch/hiera
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-charadesMViT-B, 32x3 (Kinetics-400 pretraining)
MAP: 44.3
action-classification-on-charadesMViT-B-24, 32x3 (Kinetics-600 pretraining)
MAP: 47.7
action-classification-on-charadesMViT-B, 32x3 (Kinetics-600 pretraining)
MAP: 47.1
action-classification-on-charadesMViT-B, 16x4 (Kinetics-600 pretraining)
MAP: 43.9
action-classification-on-charadesMViT-B-24, 32x3 (Kinetics-400 pretraining)
MAP: 46.3
action-classification-on-charadesMViT-B, 16x4 (Kinetics-400 pretraining)
MAP: 40
action-classification-on-kinetics-400MViT-B, 32x3
Acc@1: 80.2
Acc@5: 94.4
action-classification-on-kinetics-400MViT-B, 16x4
Acc@1: 78.4
Acc@5: 93.5
action-classification-on-kinetics-400MViT-B, 64x3
Acc@1: 81.2
Acc@5: 95.1
action-classification-on-kinetics-400MViT-S
Acc@1: 76
Acc@5: 92.1
action-classification-on-kinetics-600MViT-B, 16x4
Top-1 Accuracy: 82.1
Top-5 Accuracy: 95.7
action-classification-on-kinetics-600MViT-B-24, 32x3
Top-1 Accuracy: 83.8
Top-5 Accuracy: 96.3
action-classification-on-kinetics-600MViT-B, 32x3
Top-1 Accuracy: 83.4
Top-5 Accuracy: 96.3
action-recognition-in-videos-on-somethingMViT-B, 32x3(Kinetics600 pretrain)
GFLOPs: 170x3
Parameters: 36.6
Top-1 Accuracy: 67.8
Top-5 Accuracy: 91.3
action-recognition-in-videos-on-somethingMViT-B, 16x4
Top-1 Accuracy: 66.2
Top-5 Accuracy: 90.2
action-recognition-in-videos-on-somethingMViT-B-24, 32x3
GFLOPs: 236x3
Parameters: 53.2M
Top-1 Accuracy: 68.7
Top-5 Accuracy: 91.5
action-recognition-on-ava-v2-2MViT-B, 64x3 (Kinetics-400 pretraining)
mAP: 27.3
action-recognition-on-ava-v2-2MViT-B, 16x4 (Kinetics-600 pretraining)
mAP: 26.1
action-recognition-on-ava-v2-2MViT-B-24, 32x3 (Kinetics-600 pretraining)
mAP: 28.7
action-recognition-on-ava-v2-2MViT-B, 32x3 (Kinetics-400 pretraining)
mAP: 26.8
action-recognition-on-ava-v2-2MViT-B, 16x4 (Kinetics-400 pretraining)
mAP: 24.5
action-recognition-on-ava-v2-2MViT-B, 32x3 (Kinetics-500 pretraining)
mAP: 27.5
image-classification-on-imagenetMViT-B-24
GFLOPs: 32.7
Number of params: 72.9M
Top 1 Accuracy: 84.8%
image-classification-on-imagenetMViT-B-16
GFLOPs: 7.8
Number of params: 37M
Top 1 Accuracy: 83.0%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp