HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li Junnan ; Li Dongxu ; Savarese Silvio ; Hoi Steven

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models

Abstract

The cost of vision-and-language pre-training has become increasinglyprohibitive due to end-to-end training of large-scale models. This paperproposes BLIP-2, a generic and efficient pre-training strategy that bootstrapsvision-language pre-training from off-the-shelf frozen pre-trained imageencoders and frozen large language models. BLIP-2 bridges the modality gap witha lightweight Querying Transformer, which is pre-trained in two stages. Thefirst stage bootstraps vision-language representation learning from a frozenimage encoder. The second stage bootstraps vision-to-language generativelearning from a frozen language model. BLIP-2 achieves state-of-the-artperformance on various vision-language tasks, despite having significantlyfewer trainable parameters than existing methods. For example, our modeloutperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainableparameters. We also demonstrate the model's emerging capabilities of zero-shotimage-to-text generation that can follow natural language instructions.

Code Repositories

salesforce/lavis
Official
pytorch
Mentioned in GitHub
yukw777/videoblip
pytorch
Mentioned in GitHub
rabiulcste/vqazero
pytorch
Mentioned in GitHub
albertotestoni/ndq_visual_objects
pytorch
Mentioned in GitHub
jiwanchung/vlis
pytorch
Mentioned in GitHub
gregor-ge/mblip
pytorch
Mentioned in GitHub
baaivision/eva
pytorch
Mentioned in GitHub
thudm/visualglm-6b
pytorch
Mentioned in GitHub
huggingface/transformers
pytorch
Mentioned in GitHub
linzhiqiu/clip-flant5
pytorch
Mentioned in GitHub
junshutang/Make-It-3D
pytorch
Mentioned in GitHub
kdr/videorag-mrr2024
Mentioned in GitHub
alibaba/graphtranslator
pytorch
Mentioned in GitHub
facebookresearch/multimodal
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
generative-visual-question-answering-on-pmcBLIP-2
BLEU-1: 7.6
image-captioning-on-coco-captionsBLIP-2 ViT-G FlanT5 XL (zero-shot)
BLEU-4: 42.4
CIDER: 144.5
image-captioning-on-coco-captionsBLIP-2 ViT-G OPT 6.7B (zero-shot)
BLEU-4: 43.5
CIDER: 145.2
image-captioning-on-coco-captionsBLIP-2 ViT-G OPT 2.7B (zero-shot)
BLEU-4: 43.7
CIDER: 145.8
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 123.7
Pre-train (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 123.7
Pre-train (#images): 1.1B
SPICE: 16.3
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 123
Pre-train (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 119.2
Pre-train (#images): 1.1B
SPICE: 15.3
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 120.2
Pre-train (#images): 1.1B
SPICE: 15.9
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 117.8
Pre-train (#images): 1.1B
SPICE: 15.4
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 124.8
Pretrain (#images): 1.1B
SPICE: 15.1
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 124.4
Pretrain (#images): 1.1B
SPICE: 14.8
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 123.4
Pretrain (#images): 1.1B
SPICE: 15.1
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 121.6
Pretrain (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 121.0
Pretrain (#images): 1.1B
SPICE: 15.3
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 119.7
Pretrain (#images): 1.1B
SPICE: 15.4
image-retrieval-on-cocoBLIP-2 ViT-G (fine-tuned)
Recall@10: 92.6
recall@1: 68.3
recall@5: 87.7
image-retrieval-on-cocoBLIP-2 ViT-L (fine-tuned)
Recall@10: 91.8
recall@1: 66.3
recall@5: 86.5
image-retrieval-on-flickr30kBLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1: 88.6
Recall@10: 98.9
Recall@5: 97.6
image-retrieval-on-flickr30kBLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1: 89.7
Recall@10: 98.9
Recall@5: 98.1
image-to-text-retrieval-on-cocoBLIP-2 (ViT-L, fine-tuned)
Recall@1: 83.5
Recall@10: 98.0
Recall@5: 96.0
image-to-text-retrieval-on-cocoBLIP-2 (ViT-G, fine-tuned)
Recall@1: 85.4
Recall@10: 98.5
Recall@5: 97.0
image-to-text-retrieval-on-flickr30kBLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1: 96.9
Recall@10: 100
Recall@5: 100
image-to-text-retrieval-on-flickr30kBLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1: 97.6
Recall@10: 100
Recall@5: 100
open-vocabulary-attribute-detection-on-ovad-1BLIP 2 (pretrained)
mean average precision: 25.5
visual-instruction-following-on-llava-benchBLIP-2
avg score: 38.1
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 34.6
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 44.7
visual-question-answering-on-gqa-test-devBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 44.4
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 36.4
visual-question-answering-on-gqa-test-devBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 33.9
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 44.2
visual-question-answering-on-mm-vetBLIP-2-12B
GPT-4 score: 22.4±0.2
Params: 12B
visual-question-answering-on-ok-vqaBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 39.4
visual-question-answering-on-ok-vqaBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 45.9
visual-question-answering-on-ok-vqaBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 31.7
visual-question-answering-on-ok-vqaBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 40.7
visual-question-answering-on-ok-vqaBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 30.2
visual-question-answering-on-ok-vqaBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 36.4
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 52.3
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 62.3
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 49.7
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 52.6
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 65
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 63
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy: 82.30
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy: 81.74
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy: 81.66
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 65.2
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 54.3
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 63.1
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 50.1
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 62.6
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 53.5
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy: 81.55
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy: 82.19
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy: 81.59
visual-question-answering-vqa-on-core-mmBLIP-2-OPT2.7B
Abductive: 18.96
Analogical: 7.5
Deductive: 2.76
Overall score: 19.31
Params: 3B
visual-question-answering-vqa-on-infoseekBLIP2
Accuracy: 14.6
visual-question-answering-vqa-on-pmc-vqaBLIP-2
Accuracy: 24.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp