HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

An Empirical Study of Training End-to-End Vision-and-Language Transformers

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

Code Repositories

iflytek/vle
pytorch
Mentioned in GitHub
claws-lab/multimodal-robustness-xmai
pytorch
Mentioned in GitHub
zdou0830/meter
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-coco-2014METER
Image-to-text R@1: 76.16
Image-to-text R@10: 96.82
Image-to-text R@5: 93.16
Text-to-image R@1: 57.08
Text-to-image R@10: 90.07
Text-to-image R@5: 82.66

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp