TaAlign(trained with image-text pairs) | 37.6 | TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification | |
CLIP Surgery (original CLIP without any fine-tuning) | 29.3 | A Closer Look at the Explainability of Contrastive Language-Image Pre-training | |