Command Palette
Search for a command to run...
{Huchuan Lu Ying Zhang}

Abstract
The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs. Despite the great progress of associating the deep cross-modal embeddings with the bi-directional ranking loss, developing the strategies for mining useful triplets and selecting appropriate margins remains a challenge in real applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss for learning discriminative image-text embeddings. The CMPM loss minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis and experiments on multiple datasets demonstrate the superiority of the proposed approach.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| cross-modal-retrieval-on-flickr30k | CMPL (ResNet) | Image-to-text R@1: 49.6 Image-to-text R@10: 86.1 Image-to-text R@5: 76.8 Text-to-image R@1: 37.3 Text-to-image R@10: 75.5 Text-to-image R@5: 65.7 |
| nlp-based-person-retrival-on-cuhk-pedes | CMPM+CMPC | R@1: 49.37 R@10: 79.27 R@5: - |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.