5 months ago

Retrieval-Enhanced Contrastive Vision-Text Models

Iscen Ahmet ; Caron Mathilde ; Fathi Alireza ; Schmid Cordelia

Abstract

Contrastive image-text models such as CLIP form the building blocks of manystate-of-the-art systems. While they excel at recognizing common genericconcepts, they still struggle on fine-grained entities which are rare, or evenabsent from the pre-training dataset. Hence, a key ingredient to their successhas been the use of large-scale curated pre-training data aiming at expandingthe set of concepts that they can memorize during the pre-training stage. Inthis work, we explore an alternative to encoding fine-grained knowledgedirectly into the model's parameters: we instead train the model to retrievethis knowledge from an external memory. Specifically, we propose to equipexisting vision-text models with the ability to refine their embedding withcross-modal retrieved information from a memory at inference time, whichgreatly improves their zero-shot predictions. Remarkably, we show that this canbe done with a light-weight, single-layer, fusion transformer on top of afrozen CLIP. Our experiments validate that our retrieval-enhanced contrastive(RECO) training improves CLIP performance substantially on several challengingfine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and+7.3 on the recent OVEN benchmark, where we even outperform the fine-tunedmodels on unseen classes.

Benchmarks

Benchmark	Methodology	Metrics
fine-grained-image-recognition-on-oven	RECO	Accuracy: 12.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning