5 months ago

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh; Quentin Duval; Kalyan Vasudev Alwala; Haoqi Fan; Vaibhav Aggarwal; Aaron Adcock; Armand Joulin; Piotr Dollár; Christoph Feichtenhofer; Ross Girshick; Rohit Girdhar; Ishan Misra

Abstract

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

Code Repositories

facebookresearch/maws

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-recognition-in-videos-on-something	MAWS (ViT-L)	Top-1 Accuracy: 74.4
few-shot-image-classification-on-imagenet-1-1	MAWS (ViT-6.5B)	Top 1 Accuracy: 63.6
few-shot-image-classification-on-imagenet-1-1	MAWS (ViT-2B)	Top 1 Accuracy: 62.1
few-shot-image-classification-on-imagenet-1-1	MAWS (ViT-H)	Top 1 Accuracy: 57.1
few-shot-image-classification-on-imagenet-10	MAWS (ViT-H)	Top 1 Accuracy: 82.5
few-shot-image-classification-on-imagenet-10	MAWS (ViT-2B)	Top 1 Accuracy: 83.7
few-shot-image-classification-on-imagenet-10	MAWS (ViT-6.5B)	Top 1 Accuracy: 84.6
few-shot-image-classification-on-imagenet-5	MAWS (ViT-H)	Top 1 Accuracy: 79.8
few-shot-image-classification-on-imagenet-5	MAWS (ViT-2B)	Top 1 Accuracy: 81.5
few-shot-image-classification-on-imagenet-5	MAWS (ViT-6.5B)	Top 1 Accuracy: 82.6
few-shot-image-classification-on-inaturalist-1	MAWS (ViT-2B)	Top 1 Accuracy: 35.5
few-shot-image-classification-on-inaturalist-2	MAWS (ViT-2B)	Top 1 Accuracy: 72.8
few-shot-image-classification-on-inaturalist-3	MAWS (ViT-2B)	Top 1 Accuracy: 80.3
image-classification-on-imagenet	MAWS (ViT-6.5B)	Number of params: 6500M Top 1 Accuracy: 90.1%
image-classification-on-imagenet	MAWS (ViT-L)	Top 1 Accuracy: 88.8%
image-classification-on-imagenet	MAWS (ViT-2B)	Number of params: 2000M Top 1 Accuracy: 89.8%
image-classification-on-imagenet	MAWS (ViT-B)	Top 1 Accuracy: 86.8%
image-classification-on-imagenet	MAWS (ViT-H)	Number of params: 650M Top 1 Accuracy: 89.5%
image-classification-on-imagenet-real	MAWS (ViT-6.5B)	Accuracy: 91.1%
image-classification-on-imagenet-real	MAWS (ViT-H)	Accuracy: 90.8%
image-classification-on-imagenet-real	MAWS (ViT-2B)	Accuracy: 90.9%
image-classification-on-imagenet-v2	MAWS (ViT-6.5B)	Top 1 Accuracy: 84.0
image-classification-on-imagenet-v2	MAWS (ViT-2B)	Top 1 Accuracy: 83.0
image-classification-on-inaturalist-2018	MAWS (ViT-2B)	Top-1 Accuracy: 91.3%
image-classification-on-objectnet	MAWS (ViT-H)	Top-1 Accuracy: 72.6
image-classification-on-objectnet	MAWS (ViT-2B)	Top-1 Accuracy: 75.8
image-classification-on-objectnet	MAWS (ViT-6.5B)	Top-1 Accuracy: 77.9
zero-shot-transfer-image-classification-on-1	MAWS (ViT-2B)	Accuracy (Private): 82.1
zero-shot-transfer-image-classification-on-1	MAWS (ViT-H)	Accuracy (Private): 81.1
zero-shot-transfer-image-classification-on-17	MAWS (ViT-2B)	Top 1 Accuracy: 96.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette