3 months ago

Improving Visual Representation Learning through Perceptual Understanding

Samyakh Tukra Frederick Hoffman Ken Chatfield

Abstract

We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.

Code Repositories

tractableai/perceptual-mae

pytorch

Benchmarks

Benchmark	Methodology	Metrics
self-supervised-image-classification-on	PercMAE (ViT-B)	Number of Params: 80M Top 1 Accuracy: 78.1%
self-supervised-image-classification-on	PercMAE (ViT-B, dVAE)	Number of Params: 80M Top 1 Accuracy: 79.8%
self-supervised-image-classification-on-1	PercMAE (ViT-L, dVAE)	Number of Params: 307M Top 1 Accuracy: 88.6%
self-supervised-image-classification-on-1	PercMAE (ViT-L)	Number of Params: 307M Top 1 Accuracy: 88.1%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette