HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Masked Image Residual Learning for Scaling Deeper Vision Transformers

Guoxi Huang Hongtao Fu Adrian G. Bors

Masked Image Residual Learning for Scaling Deeper Vision Transformers

Abstract

Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.

Code Repositories

russellllaputa/MIRL
Official
paddle

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-imagenetMIRL(ViT-S-54)
GFLOPs: 18.8
Number of params: 96M
Top 1 Accuracy: 84.8%
image-classification-on-imagenetMIRL (ViT-B-48)
GFLOPs: 67.0
Number of params: 341M
Top 1 Accuracy: 86.2%
self-supervised-image-classification-on-1MIRL (ViT-B-48)
Number of Params: 341M
Top 1 Accuracy: 86.2%
self-supervised-image-classification-on-1MIRL (ViT-S-54)
Number of Params: 96M
Top 1 Accuracy: 84.8%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp