5 months ago

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou Zi-Yi ; Kamath Aishwarya ; Gan Zhe ; Zhang Pengchuan ; Wang Jianfeng ; Li Linjie ; Liu Zicheng ; Liu Ce ; LeCun Yann ; Peng

Abstract

Vision-language (VL) pre-training has recently received considerableattention. However, most existing end-to-end pre-training approaches eitheronly aim to tackle VL tasks such as image-text retrieval, visual questionanswering (VQA) and image captioning that test high-level understanding ofimages, or only target region-level understanding for tasks such as phrasegrounding and object detection. We present FIBER (Fusion-In-the-Backbone-basedtransformER), a new VL model architecture that can seamlessly handle both thesetypes of tasks. Instead of having dedicated transformer layers for fusion afterthe uni-modal backbones, FIBER pushes multimodal fusion deep into the model byinserting cross-attention into the image and text backbones, bringing gains interms of memory and performance. In addition, unlike previous work that iseither only pre-trained on image-text data or on fine-grained data withbox-level annotations, we present a two-stage pre-training strategy that usesboth these kinds of data efficiently: (i) coarse-grained pre-training based onimage-text data; followed by (ii) fine-grained pre-training based onimage-text-box data. We conduct comprehensive experiments on a wide range of VLtasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,referring expression comprehension, and object detection. Using deep multimodalfusion coupled with the two-stage pre-training, FIBER provides consistentperformance improvements over strong baselines across all tasks, oftenoutperforming methods using magnitudes more data. Code is available athttps://github.com/microsoft/FIBER.

Code Repositories

microsoft/fiber

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
described-object-detection-on-description	FIBER-B	Intra-scenario ABS mAP: 26.0 Intra-scenario FULL mAP: 22.7 Intra-scenario PRES mAP: 21.5
object-detection-on-coco-o	FIBER-B (Swin-B)	Average mAP: 33.7 Effective Robustness: 11.43
phrase-grounding-on-flickr30k-entities-dev	Fiber-B	R@1: 87.1 R@10: 97.4 R@5: 96.1
phrase-grounding-on-flickr30k-entities-test	FIBER-B	R@1: 87.4 R@10: 97.6 R@5: 96.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette