HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Hao Li; Jingkuan Song; Lianli Gao; Xiaosu Zhu; Heng Tao Shen

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Abstract

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

Code Repositories

leolee99/pau
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-didemoPAU
text-to-video Mean Rank: 12.9
text-to-video Median Rank: 2.0
text-to-video R@1: 48.6
text-to-video R@10: 84.5
text-to-video R@5: 76.0
video-to-text Mean Rank: 9.8
video-to-text Median Rank: 2.0
video-to-text R@1: 48.1
video-to-text R@10: 85.7
video-to-text R@5: 74.2
video-retrieval-on-msr-vtt-1kaPAU
text-to-video Mean Rank: 14.0
text-to-video Median Rank: 2.0
text-to-video R@1: 48.5
text-to-video R@10: 82.5
text-to-video R@5: 72.7
video-to-text Mean Rank: 9.7
video-to-text Median Rank: 2.0
video-to-text R@1: 48.3
video-to-text R@10: 83.2
video-to-text R@5: 73.0
video-retrieval-on-msvdPAU
text-to-video Mean Rank: 9.6
text-to-video Median Rank: 2.0
text-to-video R@1: 47.3
text-to-video R@10: 85.5
text-to-video R@5: 77.4
video-to-text Mean Rank: 2.4
video-to-text Median Rank: 1.0
video-to-text R@1: 68.9
video-to-text R@10: 97.1
video-to-text R@5: 93.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp