HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Junyang Chen; Hanjiang Lai

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Abstract

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI

Code Repositories

Chen-Junyang-cn/PLI
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-composed-image-retrieval-zs-cir-onMTCIR (BLIP B/16)
mAP@10: 8.03
zero-shot-composed-image-retrieval-zs-cir-onMTCIR (CLIP L/14)
mAP@10: 11.63
zero-shot-composed-image-retrieval-zs-cir-on-1MTCIR (CLIP L/14)
R@5: 54.58
zero-shot-composed-image-retrieval-zs-cir-on-1MTCIR (BLIP B/16)
R@5: 58.87
zero-shot-composed-image-retrieval-zs-cir-on-2MTCIR (CLIP L/14)
(Recall@10+Recall@50)/2: 46.42

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp