Command Palette
Search for a command to run...
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Tan; Mohit Bansal

Abstract
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-a-okvqa | LXMERT | DA VQA Score: 25.9 MC Accuracy: 41.6 |
| visual-question-answering-on-gqa-test-dev | LXMERT (Pre-train + scratch) | Accuracy: 60.0 |
| visual-question-answering-on-gqa-test-std | LXMERT | Accuracy: 60.3 |
| visual-question-answering-on-gqa-test2019 | LXR955, Ensemble | Accuracy: 62.71 Binary: 79.79 Consistency: 93.1 Distribution: 6.42 Open: 47.64 Plausibility: 85.21 Validity: 96.36 |
| visual-question-answering-on-gqa-test2019 | LXR955, Single Model | Accuracy: 60.33 Binary: 77.16 Consistency: 89.59 Distribution: 5.69 Open: 45.47 Plausibility: 84.53 Validity: 96.35 |
| visual-question-answering-on-vizwiz-2018-1 | LXR955, No Ensemble | number: 24.76 other: 39.0 overall: 55.4 unanswerable: 82.26 yes/no: 74.0 |
| visual-question-answering-on-vqa-v2-test-dev | LXMERT (Pre-train + scratch) | Accuracy: 69.9 |
| visual-question-answering-on-vqa-v2-test-std | LXMERT | overall: 72.5 |
| visual-reasoning-on-nlvr2-dev | LXMERT (Pre-train + scratch) | Accuracy: 74.9 |
| visual-reasoning-on-nlvr2-test | LXMERT | Accuracy: 76.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.