Command Palette
Search for a command to run...

Abstract
We introduce a novel method for pre-training of large-scale vision encoders.Building on recent advancements in autoregressive pre-training of visionmodels, we extend this framework to a multimodal setting, i.e., images andtext. In this paper, we present AIMV2, a family of generalist vision encoderscharacterized by a straightforward pre-training process, scalability, andremarkable performance across a range of downstream tasks. This is achieved bypairing the vision encoder with a multimodal decoder that autoregressivelygenerates raw image patches and text tokens. Our encoders excel not only inmultimodal evaluations but also in vision benchmarks such as localization,grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistentlyoutperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) inmultimodal image understanding across diverse settings.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-imagenet | AIMv2-3B | Top 1 Accuracy: 88.5% |
| image-classification-on-imagenet | AIMv2-2B | Number of params: 2700M |
| image-classification-on-imagenet | AIMv2-3B (448 res) | Top 1 Accuracy: 89.5% |
| image-classification-on-imagenet | AIMv2-L | Number of params: 300M Top 1 Accuracy: 86.6% |
| image-classification-on-imagenet | AIMv2-1B | Number of params: 1200M Top 1 Accuracy: 88.1% |
| image-classification-on-imagenet | AIMv2-H | Number of params: 600M Top 1 Accuracy: 87.5% |
| image-classification-on-inaturalist | AIMv2-1B | Top 1 Accuracy: 79.7 |
| image-classification-on-inaturalist | AIMv2-H | Top 1 Accuracy: 77.9 |
| image-classification-on-inaturalist | AIMv2-3B | Top 1 Accuracy: 81.5 |
| image-classification-on-inaturalist | AIMv2-L | Top 1 Accuracy: 76 |
| image-classification-on-inaturalist | AIMv2-3B (448 res) | Top 1 Accuracy: 85.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.