Command Palette
Search for a command to run...
Xiaofeng Mao Yuefeng Chen Xiaojun Jia Rong Zhang Hui Xue Zhao Li

Abstract
Contrastive Language-Image Pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to "[CLASS]" by using similarity between the image and the prompt sentence "a [CONTEXT] of [CLASS]". Based on exhaustive text cues in "[CONTEXT]", CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback-Leibler Divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher In-Distribution (ID) and Out-Of-Distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous Domain Generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| domain-generalization-on-domainnet | CAR-FT (CLIP, ViT-B/16) | Average Accuracy: 62.5 |
| domain-generalization-on-imagenet-a | CAR-FT (CLIP, ViT-L/14@336px) | Top-1 accuracy %: 81.5 |
| domain-generalization-on-imagenet-r | CAR-FT (CLIP, ViT-L/14@336px) | Top-1 Error Rate: 10.3 |
| domain-generalization-on-imagenet-sketch | CAR-FT (CLIP, ViT-L/14@336px) | Top-1 accuracy: 65.5 |
| domain-generalization-on-office-home | CAR-FT (CLIP, ViT-B/16) | Average Accuracy: 85.7 |
| domain-generalization-on-pacs-2 | CAR-FT (CLIP, ViT-B/16) | Average Accuracy: 96.8 |
| domain-generalization-on-terraincognita | CAR-FT (CLIP, ViT-B/16) | Average Accuracy: 61.9 |
| domain-generalization-on-vlcs | CAR-FT (CLIP, ViT-B/16) | Average Accuracy: 85.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.