Command Palette
Search for a command to run...
Menglin Jia Luming Tang Bor-Chun Chen Claire Cardie Serge Belongie Bharath Hariharan Ser-Nam Lim

Abstract
The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| long-tail-learning-on-cifar-100-lt-r-10 | VPT | Error Rate: 10.4 |
| long-tail-learning-on-cifar-100-lt-r-100 | VPT | Error Rate: 19 |
| long-tail-learning-on-cifar-100-lt-r-50 | VPT | Error Rate: 15.2 |
| prompt-engineering-on-imagenet-21k | VPT | Accuracy: 24.8 |
| visual-prompt-tuning-on-fgvc | VPT-Deep (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 72.02 |
| visual-prompt-tuning-on-fgvc | VPT-Shallow (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 57.84 |
| visual-prompt-tuning-on-fgvc | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 83.12 |
| visual-prompt-tuning-on-fgvc | VPT-Shallow (ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 79.26 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 67.34 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 39.96 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 70.27 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 36.02 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 60.61 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 69.65 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 83.04 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 82.26 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 37.55 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 42.38 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 26.57 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 27.50 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.