HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Abstract

Vision-language (VL) pre-training has recently received considerableattention. However, most existing end-to-end pre-training approaches eitheronly aim to tackle VL tasks such as image-text retrieval, visual questionanswering (VQA) and image captioning that test high-level understanding ofimages, or only target region-level understanding for tasks such as phrasegrounding and object detection. We present FIBER (Fusion-In-the-Backbone-basedtransformER), a new VL model architecture that can seamlessly handle both thesetypes of tasks. Instead of having dedicated transformer layers for fusion afterthe uni-modal backbones, FIBER pushes multimodal fusion deep into the model byinserting cross-attention into the image and text backbones, bringing gains interms of memory and performance. In addition, unlike previous work that iseither only pre-trained on image-text data or on fine-grained data withbox-level annotations, we present a two-stage pre-training strategy that usesboth these kinds of data efficiently: (i) coarse-grained pre-training based onimage-text data; followed by (ii) fine-grained pre-training based onimage-text-box data. We conduct comprehensive experiments on a wide range of VLtasks, ranging from VQA, image captioning, and retrieval, to phrase grounding,referring expression comprehension, and object detection. Using deep multimodalfusion coupled with the two-stage pre-training, FIBER provides consistentperformance improvements over strong baselines across all tasks, oftenoutperforming methods using magnitudes more data. Code is available athttps://github.com/microsoft/FIBER.

Code Repositories

microsoft/fiber
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
described-object-detection-on-descriptionFIBER-B
Intra-scenario ABS mAP: 26.0
Intra-scenario FULL mAP: 22.7
Intra-scenario PRES mAP: 21.5
object-detection-on-coco-oFIBER-B (Swin-B)
Average mAP: 33.7
Effective Robustness: 11.43
phrase-grounding-on-flickr30k-entities-devFiber-B
R@1: 87.1
R@10: 97.4
R@5: 96.1
phrase-grounding-on-flickr30k-entities-testFIBER-B
R@1: 87.4
R@10: 97.6
R@5: 96.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp