Command Palette
Search for a command to run...
Cao Min ; Bai Yang ; Zeng Ziyin ; Ye Mang ; Zhang Min

Abstract
Text-based Person Search (TBPS) aims to retrieve the person images usingnatural language descriptions. Recently, Contrastive Language Image Pretraining(CLIP), a universal large cross-modal vision-language pre-training model, hasremarkably performed over various cross-modal downstream tasks due to itspowerful cross-modal semantic learning capacity. TPBS, as a fine-grainedcross-modal retrieval task, is also facing the rise of research on theCLIP-based TBPS. In order to explore the potential of the visual-languagepre-training model for downstream TBPS tasks, this paper makes the firstattempt to conduct a comprehensive empirical study of CLIP for TBPS and thuscontribute a straightforward, incremental, yet strong TBPS-CLIP baseline to theTBPS community. We revisit critical design considerations under CLIP, includingdata augmentation and loss function. The model, with the aforementioned designsand practical training tricks, can attain satisfactory performance without anysophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP inmodel generalization and model compression, demonstrating the effectiveness ofTBPS-CLIP from various aspects. This work is expected to provide empiricalinsights and highlight future CLIP-based TBPS research.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| nlp-based-person-retrival-on-cuhk-pedes | TBPS-CLIP (ViT-B/16) | R@1: 73.54 R@10: 92.35 R@5: 88.19 mAP: 65.38 |
| text-based-person-retrieval-on-icfg-pedes | TBPS-CLIP (ViT-B/16) | R@1: 65.05 R@10: 85.47 R@5: 80.34 mAP: 39.83 |
| text-based-person-retrieval-on-rstpreid-1 | TBPS-CLIP (ViT-B/16) | R@1: 61.95 R@10: 88.75 R@5: 83.55 mAP: 48.26 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.