HyperAI超神经

Visual Reasoning On Winoground

评估指标

Group Score
Image Score
Text Score

评测结果

各个模型在此基准测试上的表现结果

模型名称
Group Score
Image Score
Text Score
Paper TitleRepository
ViLBERT base4.757.2523.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER (finetuned, Flickr30k)14.7520.7543.5Equivariant Similarity for Vision-Language Foundation Models
BLIP (ITM)13.315.835.8Revisiting the Role of Language Priors in Vision-Language Models
BLIP2 (SGVL)23.328.542.8Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs-
Gemini + CoCoT27.7532.540CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V (CoT, pick b/w two options)58.7568.7575.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
OpenFlamingo + CoCoT41.555.2558.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
COCA ViT-L14 (f.t on COCO)8.2511.5028.25What You See is What You Read? Improving Text-Image Alignment Evaluation
OFA large (ITM)7.2510.2530.75Simple Token-Level Confidence Improves Caption Correctness-
VSE++ (COCO, VGG)3.505.5018.75Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
METER12.0015.7539.25Equivariant Similarity for Vision-Language Foundation Models
OpenFlamingo33.2541.2539CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
LDM-CLIP (SelfEval)-7.2522.75SelfEval: Leveraging the discriminative nature of generative models for evaluation-
CLIP (ViT-L/14)-8.030.25SelfEval: Leveraging the discriminative nature of generative models for evaluation-
BLIP 129M (CapFilt/L)12.215.234.7Measuring Progress in Fine-grained Vision-and-Language Understanding
KeyComp* (GPT-4)18.228.743.5Prompting Large Vision-Language Models for Compositional Reasoning
LLaVA-7B (GPTScore)10.5017.0025.50An Examination of the Compositionality of Large Generative Vision-Language Models
TIFA11.3012.5019.00What You See is What You Read? Improving Text-Image Alignment Evaluation
Diffusion Classifier (zero-shot)--34.00Your Diffusion Model is Secretly a Zero-Shot Classifier
KeyComp* (GPT-3.5)17.427.842.7Prompting Large Vision-Language Models for Compositional Reasoning
0 of 113 row(s) selected.