3 months ago

Generative Pretraining from Pixels

{Mark Chen Jeff Wu Rewon Child Ilya Sutskever David Luan Alec Radford Heewoo Jun Prafulla Dhariwal}

Abstract

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.

Benchmarks

Benchmark	Methodology	Metrics
image-classification-on-stl-10	iGPT-L	Percentage correct: 97.1
image-classification-on-stl-10	AMDIM-L	Percentage correct: 94.2
self-supervised-image-classification-on	iGPT-XL (64x64, 3072 features)	Number of Params: 6800M Top 1 Accuracy: 68.7%
self-supervised-image-classification-on	iGPT-L (48x48)	Number of Params: 1400M Top 1 Accuracy: 65.2%
self-supervised-image-classification-on	iGPT-XL (64x64, 15360 features)	Number of Params: 6801M Top 1 Accuracy: 72.0%
self-supervised-image-classification-on	iGPT-L (32x32)	Number of Params: 1400M Top 1 Accuracy: 60.3%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning