Command Palette
Search for a command to run...
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
Zhiyu Zhao Bingkun Huang Sen Xing Gangshan Wu Yu Qiao Limin Wang

Abstract
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | AMD(ViT-B/16) | Acc@1: 82.2 Acc@5: 95.3 FLOPs (G) x views: 180x15 Parameters (M): 87 |
| action-classification-on-kinetics-400 | AMD(ViT-S/16) | Acc@1: 80.1 Acc@5: 94.5 FLOPs (G) x views: 57X15 Parameters (M): 22 |
| action-recognition-in-videos-on-hmdb-51 | AMD(ViT-B/16) | Average accuracy of 3 splits: 79.6 |
| action-recognition-in-videos-on-something | AMD(ViT-S/16) | GFLOPs: 57x6 Parameters: 22 Top-1 Accuracy: 70.2 Top-5 Accuracy: 92.5 |
| action-recognition-in-videos-on-something | AMD(ViT-B/16) | GFLOPs: 180x6 Parameters: 87 Top-1 Accuracy: 73.3 Top-5 Accuracy: 94.0 |
| action-recognition-in-videos-on-ucf101 | AMD(ViT-B/16) | 3-fold Accuracy: 97.1 |
| action-recognition-on-ava-v2-2 | AMD(ViT-B/16) | mAP: 33.5 |
| image-classification-on-imagenet | AMD(ViT-B/16) | Number of params: 87M Top 1 Accuracy: 84.6% |
| image-classification-on-imagenet | AMD(ViT-S/16) | Number of params: 22M Top 1 Accuracy: 82.1% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.