Command Palette
Search for a command to run...
Improving Visual Representation Learning through Perceptual Understanding
Samyakh Tukra Frederick Hoffman Ken Chatfield

Abstract
We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| self-supervised-image-classification-on | PercMAE (ViT-B) | Number of Params: 80M Top 1 Accuracy: 78.1% |
| self-supervised-image-classification-on | PercMAE (ViT-B, dVAE) | Number of Params: 80M Top 1 Accuracy: 79.8% |
| self-supervised-image-classification-on-1 | PercMAE (ViT-L, dVAE) | Number of Params: 307M Top 1 Accuracy: 88.6% |
| self-supervised-image-classification-on-1 | PercMAE (ViT-L) | Number of Params: 307M Top 1 Accuracy: 88.1% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.