8 months ago

Abstract

Cross-modal retrieval relies on well-matched large-scale datasets that arelaborious in practice. Recently, to alleviate expensive data collection,co-occurring pairs from the Internet are automatically harvested for training.However, it inevitably includes mismatched pairs, \ie, noisy correspondences,undermining supervision reliability and degrading performance. Current methodsleverage deep neural networks' memorization effect to address noisycorrespondences, which overconfidently focus on \emph{similarity-guidedtraining with hard negatives} and suffer from self-reinforcing errors. In lightof above, we introduce a novel noisy correspondence learning framework, namely\textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM).Specifically, by viewing sample matching as classification tasks within thebatch, we generate classification logits for the given sample. Instead of asingle similarity score, we refine sample filtration through energy uncertaintyand estimate model's sensitivity of selected clean samples using swappedclassification entropy, in view of the overall prediction distribution.Additionally, we propose cross-modal biased complementary learning to leveragenegative matches overlooked in hard-negative training, further improving modeloptimization stability and curbing self-reinforcing errors. Extensiveexperiments on challenging benchmarks affirm the efficacy and efficiency ofSREM.

Source PDF