HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Wenhui Wang; Hangbo Bao; Li Dong; Johan Bjorck; Zhiliang Peng; Qiang Liu; Kriti Aggarwal; Owais Khan Mohammed; Saksham Singhal; Subhojit Som; Furu Wei

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Abstract

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Code Repositories

lyan62/data-curation
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-coco-2014BEiT-3
Image-to-text R@1: 84.8
Image-to-text R@10: 98.3
Image-to-text R@5: 96.5
Text-to-image R@1: 67.2
Text-to-image R@10: 87.7
Text-to-image R@5: 92.8
cross-modal-retrieval-on-flickr30kBEiT-3
Image-to-text R@1: 98.0
Image-to-text R@10: 100.0
Image-to-text R@5: 100.0
Text-to-image R@1: 90.3
Text-to-image R@10: 99.5
Text-to-image R@5: 98.7
instance-segmentation-on-cocoBEiT-3
mask AP: 54.8
object-detection-on-cocoBEiT-3
box mAP: 63.7
semantic-segmentation-on-ade20kBEiT-3
Params (M): 1900
Validation mIoU: 62.8
semantic-segmentation-on-ade20k-valBEiT-3
mIoU: 62.8
visual-question-answering-on-vqa-v2-test-devBEiT-3
Accuracy: 84.19
visual-question-answering-on-vqa-v2-test-stdBEiT-3
overall: 84.03
visual-reasoning-on-nlvr2-devBEiT-3
Accuracy: 91.51
visual-reasoning-on-nlvr2-testBEiT-3
Accuracy: 92.58
zero-shot-cross-modal-retrieval-on-flickr30kBEiT-3
Image-to-text R@1: 94.9
Image-to-text R@10: 100.0
Image-to-text R@5: 99.9
Text-to-image R@1: 81.5
Text-to-image R@10: 97.8
Text-to-image R@5: 95.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp