HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Pavan Kumar Anasosalu Vasu James Gabriel Jeff Zhu Oncel Tuzel Anurag Ranjan

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Abstract

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at https://github.com/apple/ml-fastvit.

Benchmarks

BenchmarkMethodologyMetrics
3d-hand-pose-estimation-on-freihandFastViT-MA36
PA-F@15mm: 0.981
PA-F@5mm: 0.722
PA-MPJPE: 6.6
PA-MPVPE: 6.7
image-classification-on-imagenetFastViT-SA24
Top 1 Accuracy: 82.6%
image-classification-on-imagenetFastViT-MA36
Top 1 Accuracy: 84.9%
image-classification-on-imagenetFastViT-SA12
Top 1 Accuracy: 80.6%
image-classification-on-imagenetFastViT-S12
Top 1 Accuracy: 79.8%
image-classification-on-imagenetFastViT-SA36
Top 1 Accuracy: 84.5%
image-classification-on-imagenetFastViT-T12
Top 1 Accuracy: 79.1%
image-classification-on-imagenetFastViT-T8
Top 1 Accuracy: 75.6%
semantic-segmentation-on-ade20kFastViT-SA36
Mean IoU (class): 42.9
semantic-segmentation-on-ade20kFastViT-SA12
Mean IoU (class): 38
semantic-segmentation-on-ade20kFastViT-SA24
Mean IoU (class): 41
semantic-segmentation-on-ade20kFastViT-MA36
Mean IoU (class): 44.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization | Papers | HyperAI