Command Palette
Search for a command to run...
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao Wenhui Wang Li Dong Qiang Liu Owais Khan Mohammed Kriti Aggarwal Subhojit Som Furu Wei

Abstract
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-photochat | VLMo | R1: 11.5 R@10: 39.4 R@5: 30.0 Sum(R@1,5,10): 83.2 |
| text-retrieval-on-image-chat | VLMo | R@1: 46.8 R@5: 67.5 Sum(R@1,5): 114.3 |
| visual-question-answering-on-vqa-v2-test-dev | VLMo | Accuracy: 82.78 |
| visual-question-answering-on-vqa-v2-test-std | VLMo | number: 67.26 other: 72.87 overall: 81.30 yes/no: 94.68 |
| visual-reasoning-on-nlvr2-dev | VLMo | Accuracy: 85.64 |
| visual-reasoning-on-nlvr2-test | VLMo | Accuracy: 86.86 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.