Command Palette
Search for a command to run...
Alex Jinpeng Wang; Pan Zhou; Mike Zheng Shou; Shuicheng Yan

Abstract
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or O" in aPTPThe block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| cross-modal-retrieval-on-coco-2014 | PTP-BLIP (14M) | Image-to-text R@1: 81.5 Image-to-text R@10: 97.9 Image-to-text R@5: 95.9 Text-to-image R@1: 64.9 Text-to-image R@10: 92.2 Text-to-image R@5: 87.4 |
| image-captioning-on-coco-captions | PTP-BLIP (14M) | BLEU-4: 40.1 CIDER: 135.0 METEOR: 30.4 SPICE: 23.7 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | PTP-BLIP | Image-to-text R@1: 69.7 Image-to-text R@10: 94.7 Image-to-text R@5: 90.0 Text-to-image R@1: 49.5 Text-to-image R@10: 84.2 Text-to-image R@5: 75.9 |
| zero-shot-cross-modal-retrieval-on-flickr30k | PTP-BLIP (14M) | Image-to-text R@1: 87.1 Image-to-text R@10: 99.3 Image-to-text R@5: 98.4 Text-to-image R@1: 73.1 Text-to-image R@10: 94.8 Text-to-image R@5: 91.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.