3 months ago

ClipCap: CLIP Prefix for Image Captioning

Ron Mokady Amir Hertz Amit H. Bermano

Abstract

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

Code Repositories

MS-P3/code7/tree/main/x_clip

mindspore

Japanese-Image-Captioning/ClipCap-for-Japanese

pytorch

Mentioned in GitHub

rmokady/clip_prefix_caption

Official

pytorch

Mentioned in GitHub

sithu31296/image-captioning

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
image-captioning-on-coco-captions	ClipCap (Transformer)	BLEU-4: 33.53 CIDER: 113.08 METEOR: 27.45 SPICE: 21.05
image-captioning-on-coco-captions	ClipCap (MLP + GPT2 tuning)	BLEU-4: 32.15 CIDER: 108.35 METEOR: 27.1 SPICE: 20.12
image-captioning-on-conceptual-captions	ClipCap (Transformer)	CIDEr: 71.82 ROUGE-L: 25.12 SPICE: 16.07
image-captioning-on-conceptual-captions	ClipCap (MLP + GPT2 tuning)	CIDEr: 87.26 ROUGE-L: 26.71 SPICE: 18.5
image-captioning-on-nocaps-entire	ClipCap (Transformer)	CIDEr: 65.83 SPICE: 10.86
image-captioning-on-nocaps-entire	ClipCap (MLP + GPT2 tuning)	CIDEr: 65.7 SPICE: 11.1
image-captioning-on-nocaps-in-domain	ClipCap (Transformer)	CIDEr: 84.85 SPICE: 12.14
image-captioning-on-nocaps-in-domain	ClipCap (MLP + GPT2 tuning)	CIDEr: 79.73 SPICE: 12.2
image-captioning-on-nocaps-near-domain	ClipCap (Transformer)	CIDEr: 66.82 SPICE: 10.92
image-captioning-on-nocaps-near-domain	ClipCap (MLP + GPT2 tuning)	CIDEr: 67.69 SPICE: 11.26
image-captioning-on-nocaps-out-of-domain	ClipCap (Transformer)	CIDEr: 49.14 SPICE: 9.57
image-captioning-on-nocaps-out-of-domain	ClipCap (MLP + GPT2 tuning)	CIDEr: 49.35 SPICE: 9.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

ClipCap: CLIP Prefix for Image Captioning

Ron Mokady Amir Hertz Amit H. Bermano

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters