3 months ago

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao

Abstract

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

Code Repositories

WebQnA/WebQA_Baseline

pytorch

Mentioned in GitHub

rmokady/clip_prefix_caption

pytorch

Mentioned in GitHub

LuoweiZhou/VLP

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
image-captioning-on-coco-captions-test	Unified VLP	BLEU-4: 36.5 CIDEr: 116.9 METEOR: 28.4 SPICE: 21.2
image-captioning-on-flickr30k-captions-test	Unified VLP	BLEU-4: 30.1 CIDEr: 67.4 METEOR: 23 SPICE: 17
visual-question-answering-on-vqa-v2-test-std	Unified VLP	overall: 70.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso Jianfeng Gao

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters