3 months ago

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Zhiyu Zhao Bingkun Huang Sen Xing Gangshan Wu Yu Qiao Limin Wang

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-kinetics-400	AMD(ViT-B/16)	Acc@1: 82.2 Acc@5: 95.3 FLOPs (G) x views: 180x15 Parameters (M): 87
action-classification-on-kinetics-400	AMD(ViT-S/16)	Acc@1: 80.1 Acc@5: 94.5 FLOPs (G) x views: 57X15 Parameters (M): 22
action-recognition-in-videos-on-hmdb-51	AMD(ViT-B/16)	Average accuracy of 3 splits: 79.6
action-recognition-in-videos-on-something	AMD(ViT-S/16)	GFLOPs: 57x6 Parameters: 22 Top-1 Accuracy: 70.2 Top-5 Accuracy: 92.5
action-recognition-in-videos-on-something	AMD(ViT-B/16)	GFLOPs: 180x6 Parameters: 87 Top-1 Accuracy: 73.3 Top-5 Accuracy: 94.0
action-recognition-in-videos-on-ucf101	AMD(ViT-B/16)	3-fold Accuracy: 97.1
action-recognition-on-ava-v2-2	AMD(ViT-B/16)	mAP: 33.5
image-classification-on-imagenet	AMD(ViT-B/16)	Number of params: 87M Top 1 Accuracy: 84.6%
image-classification-on-imagenet	AMD(ViT-S/16)	Number of params: 22M Top 1 Accuracy: 82.1%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning