HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Li Junnan ; Li Dongxu ; Xiong Caiming ; Hoi Steven

BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for manyvision-language tasks. However, most existing pre-trained models only excel ineither understanding-based tasks or generation-based tasks. Furthermore,performance improvement has been largely achieved by scaling up the datasetwith noisy image-text pairs collected from the web, which is a suboptimalsource of supervision. In this paper, we propose BLIP, a new VLP frameworkwhich transfers flexibly to both vision-language understanding and generationtasks. BLIP effectively utilizes the noisy web data by bootstrapping thecaptions, where a captioner generates synthetic captions and a filter removesthe noisy ones. We achieve state-of-the-art results on a wide range ofvision-language tasks, such as image-text retrieval (+2.7% in averagerecall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score).BLIP also demonstrates strong generalization ability when directly transferredto video-language tasks in a zero-shot manner. Code, models, and datasets arereleased at https://github.com/salesforce/BLIP.

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-nocaps-val-in-domainBLIP_ViT-L
CIDEr: 114.9
Pre-train (#images): 129M
SPICE: 15.2
image-captioning-on-nocaps-val-in-domainBLIP_CapFilt-L
CIDEr: 111.8
Pre-train (#images): 129M
SPICE: 14.9
image-captioning-on-nocaps-val-near-domainBLIP_ViT-L
CIDEr: 112.1
Pre-train (#images): 129M
SPICE: 14.9
image-captioning-on-nocaps-val-near-domainBLIP_CapFilt-L
CIDEr: 108.6
Pre-train (#images): 129M
SPICE: 14.8
image-captioning-on-nocaps-val-out-domainBLIP_CapFilt-L
CIDEr: 111.5
Pretrain (#images): 129M
SPICE: 14.2
image-captioning-on-nocaps-val-out-domainBLIP_ViT-L
CIDEr: 115.3
Pretrain (#images): 129M
SPICE: 14.4
image-captioning-on-nocaps-val-overallBLIP_CapFilt-L
CIDEr: 109.6
Pretrain (#images): 129M
SPICE: 14.7
image-captioning-on-nocaps-val-overallBLIP_ViT-L
CIDEr: 113.2
Pretrain (#images): 129M
SPICE: 14.8
image-text-matching-on-commercialadsdatasetBLIP
ADD(S) AUC: 83.51
open-vocabulary-attribute-detection-on-ovad-1BLIP
mean average precision: 24.3
visual-reasoning-on-nlvr2-testBLIP-129M
Accuracy: 83.09

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp