Command Palette
Search for a command to run...
Guo Zixian ; Dong Bowen ; Ji Zhilong ; Bai Jinfeng ; Guo Yiwen ; Zuo Wangmeng

Abstract
Prompt tuning has been employed as an efficient way to adapt largevision-language pre-trained models (e.g. CLIP) to various downstream tasks indata-limited or label-limited settings. Nonetheless, visual data (e.g., images)is by default prerequisite for learning prompts in existing methods. In thiswork, we advocate that the effectiveness of image-text contrastive learning inaligning the two modalities (for training CLIP) further makes it feasible totreat texts as images for prompt tuning and introduce TaI prompting. Incontrast to the visual data, text descriptions are easy to collect, and theirclass labels can be directly derived. Particularly, we apply TaI prompting tomulti-label image recognition, where sentences in the wild serve asalternatives to images for prompt tuning. Moreover, with TaI, double-grainedprompt tuning (TaI-DPT) is further presented to extract both coarse-grained andfine-grained embeddings for enhancing the multi-label recognition performance.Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIPby a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE,while it can be combined with existing methods of prompting from images toimprove recognition performance further. Code is released athttps://github.com/guozix/TaI-DPT.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| multi-label-image-recognition-with-partial | DualCoOp+TaI-DPT | Average mAP: 83.6 |
| multi-label-image-recognition-with-partial-1 | DualCoOp+TaI-DPT | Average mAP: 94.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.