Command Palette
Search for a command to run...
{Mark Chen Jeff Wu Rewon Child Ilya Sutskever David Luan Alec Radford Heewoo Jun Prafulla Dhariwal}

Abstract
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-stl-10 | iGPT-L | Percentage correct: 97.1 |
| image-classification-on-stl-10 | AMDIM-L | Percentage correct: 94.2 |
| self-supervised-image-classification-on | iGPT-XL (64x64, 3072 features) | Number of Params: 6800M Top 1 Accuracy: 68.7% |
| self-supervised-image-classification-on | iGPT-L (48x48) | Number of Params: 1400M Top 1 Accuracy: 65.2% |
| self-supervised-image-classification-on | iGPT-XL (64x64, 15360 features) | Number of Params: 6801M Top 1 Accuracy: 72.0% |
| self-supervised-image-classification-on | iGPT-L (32x32) | Number of Params: 1400M Top 1 Accuracy: 60.3% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.