METER (finetuned, Flickr30k) | 14.75 | 20.75 | 43.5 | Equivariant Similarity for Vision-Language Foundation Models | |
GPT-4V (CoT, pick b/w two options) | 58.75 | 68.75 | 75.25 | The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | - |
COCA ViT-L14 (f.t on COCO) | 8.25 | 11.50 | 28.25 | What You See is What You Read? Improving Text-Image Alignment Evaluation | |
Diffusion Classifier (zero-shot) | - | - | 34.00 | Your Diffusion Model is Secretly a Zero-Shot Classifier | |