HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Ziyang Wang; Yi-Lin Sung; Feng Cheng; Gedas Bertasius; Mohit Bansal

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

Abstract

The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA.

Code Repositories

ziyang412/ucofia
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-msr-vttUCoFiA
text-to-video R@1: 49.4
text-to-video R@10: 83.5
text-to-video R@5: 72.1
video-retrieval-on-msr-vtt-1kaUCoFiA
text-to-video R@1: 49.4
text-to-video R@10: 83.5
text-to-video R@5: 72.1
video-to-text R@1: 47.1
video-to-text R@10: 83.0
video-to-text R@5: 74.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp