Command Palette
Search for a command to run...
Ron Mokady Amir Hertz Amit H. Bermano

Abstract
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-captioning-on-coco-captions | ClipCap (Transformer) | BLEU-4: 33.53 CIDER: 113.08 METEOR: 27.45 SPICE: 21.05 |
| image-captioning-on-coco-captions | ClipCap (MLP + GPT2 tuning) | BLEU-4: 32.15 CIDER: 108.35 METEOR: 27.1 SPICE: 20.12 |
| image-captioning-on-conceptual-captions | ClipCap (Transformer) | CIDEr: 71.82 ROUGE-L: 25.12 SPICE: 16.07 |
| image-captioning-on-conceptual-captions | ClipCap (MLP + GPT2 tuning) | CIDEr: 87.26 ROUGE-L: 26.71 SPICE: 18.5 |
| image-captioning-on-nocaps-entire | ClipCap (Transformer) | CIDEr: 65.83 SPICE: 10.86 |
| image-captioning-on-nocaps-entire | ClipCap (MLP + GPT2 tuning) | CIDEr: 65.7 SPICE: 11.1 |
| image-captioning-on-nocaps-in-domain | ClipCap (Transformer) | CIDEr: 84.85 SPICE: 12.14 |
| image-captioning-on-nocaps-in-domain | ClipCap (MLP + GPT2 tuning) | CIDEr: 79.73 SPICE: 12.2 |
| image-captioning-on-nocaps-near-domain | ClipCap (Transformer) | CIDEr: 66.82 SPICE: 10.92 |
| image-captioning-on-nocaps-near-domain | ClipCap (MLP + GPT2 tuning) | CIDEr: 67.69 SPICE: 11.26 |
| image-captioning-on-nocaps-out-of-domain | ClipCap (Transformer) | CIDEr: 49.14 SPICE: 9.57 |
| image-captioning-on-nocaps-out-of-domain | ClipCap (MLP + GPT2 tuning) | CIDEr: 49.35 SPICE: 9.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.