HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen Liunian Harold Li Hao Tan Mohit Bansal Anna Rohrbach Kai-Wei Chang Zhewei Yao Kurt Keutzer

How Much Can CLIP Benefit Vision-and-Language Tasks?

Abstract

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

Code Repositories

clip-vil/CLIP-ViL
Official
pytorch
Mentioned in GitHub
facebookresearch/reliable_vqa
pytorch
Mentioned in GitHub
jianjieluo/openai-clip-feature
pytorch
Mentioned in GitHub
gchhablani/multilingual-vqa
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
vision-and-language-navigation-on-rxrCLEAR-CLIP
ndtw: 53.69
visual-entailment-on-snli-ve-valCLIP-ViL
Accuracy: 80.20

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp