3 months ago

CoLLM: A Large Language Model for Composed Image Retrieval

Chuong Huynh Jinyu Yang Ashish Tawari Mubarak Shah Son Tran Raffay Hamid Trishul Chilimbi Abhinav Shrivastava

Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve imagesbased on a multimodal query. Typical training data consists of tripletscontaining a reference image, a textual description of desired modifications,and the target image, which are expensive and time-consuming to acquire. Thescarcity of CIR datasets has led to zero-shot approaches utilizing synthetictriplets or leveraging vision-language models (VLMs) with ubiquitousweb-crawled image-caption pairs. However, these methods have significantlimitations: synthetic triplets suffer from limited scale, lack of diversity,and unnatural modification text, while image-caption pairs hinder jointembedding learning of the multimodal query due to the absence of triplet data.Moreover, existing approaches struggle with complex and nuanced modificationtexts that demand sophisticated fusion and understanding of vision and languagemodalities. We present CoLLM, a one-stop framework that effectively addressesthese limitations. Our approach generates triplets on-the-fly fromimage-caption pairs, enabling supervised training without manual annotation. Weleverage Large Language Models (LLMs) to generate joint embeddings of referenceimages and modification texts, facilitating deeper multimodal fusion.Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale datasetcomprising 3.4M samples, and refine existing CIR benchmarks (CIRR andFashion-IQ) to enhance evaluation reliability. Experimental results demonstratethat CoLLM achieves state-of-the-art performance across multiple CIR benchmarksand settings. MTCIR yields competitive results, with up to 15% performanceimprovement. Our refined benchmarks provide more reliable evaluation metricsfor CIR models, contributing to the advancement of this important field.

Code Repositories

hmchuong/CoLLM

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-composed-image-retrieval-zs-cir-on	CoLLM (Pretrained - BLIP-L/16)	MAP@5: 19.7 mAP@10: 20.4 mAP@50: 23.1
zero-shot-composed-image-retrieval-zs-cir-on	CoLLM (Pretrained - CLIP-L/14)	MAP@5: 20.3 mAP@10: 20.8 mAP@50: 23.4
zero-shot-composed-image-retrieval-zs-cir-on-1	CoLLM (finetuned - BLIP-L/16)	R@1: 45.8 R@10: 84.7 R@50: 95.8
zero-shot-composed-image-retrieval-zs-cir-on-1	CoLLM (Pretrained - BLIP-L/16)	R@1: 35.00 R@10: 78.6 R@50: 94.2
zero-shot-composed-image-retrieval-zs-cir-on-1	CoLLM (Pretrained - CLIP-L/14)	R@1: 29.7 R@10: 72.8 R@50: 91.5
zero-shot-composed-image-retrieval-zs-cir-on-2	CoLLM (Pretrained - CLIP-L/14)	(Recall@10+Recall@50)/2: 39.8 R@10: 30.1 R@50: 49.5
zero-shot-composed-image-retrieval-zs-cir-on-2	CoLLM (Pretrained - BLIP-L/16)	(Recall@10+Recall@50)/2: 45.3 R@10: 34.6 R@50: 56.0
zero-shot-composed-image-retrieval-zs-cir-on-2	CoLLM (finetuned - BLIP-L/16)	(Recall@10+Recall@50)/2: 49.9 R@10: 39.1 R@50: 60.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette