Visual Reasoning On Winoground

Metrics

Group Score

Image Score

Text Score

Results

Performance results of various models on this benchmark

				Paper Title	Repository
GPT-4V (CoT, pick b/w two options)	58.75	68.75	75.25	The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task	-
GPT-4V (pick b/w two options)	39.25	46.25	69.25	The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task	-
MMICL + CoCoT	50.75	52.5	64.25	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V + CoCoT	44.5	49.5	58.5	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + CoCoT	41.5	55.25	58.25	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V	37.75	42.5	54.5	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
FIBER (EqSim)	27.5	32.00	51.5	Equivariant Similarity for Vision-Language Foundation Models
FIBER (finetuned, Flickr30k)	23.00	26.50	51.25	Equivariant Similarity for Vision-Language Foundation Models
MMICL + CCoT	47.5	48	51	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + DDCoT	39	47.25	47.5	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
VQ2	30.5	42.2	47	What You See is What You Read? Improving Text-Image Alignment Evaluation
MMICL + DDCoT	36.75	45	46.75	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 16M	21.2	24.5	46.7	Measuring Progress in Fine-grained Vision-and-Language Understanding
PaLI (ft SNLI-VE + Synthetic Data)	28.75	38	46.5	What You See is What You Read? Improving Text-Image Alignment Evaluation
FIBER	22.25	25.75	46.25	Equivariant Similarity for Vision-Language Foundation Models
MMICL (FLAN-T5-XXL)	43.00	44.99	45.50	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
METER (EqSim)	18.75	22.75	45.0	Equivariant Similarity for Vision-Language Foundation Models
PaLI (ft SNLI-VE)	28.70	41.50	45.00	What You See is What You Read? Improving Text-Image Alignment Evaluation
Gemini + DDCoT	23.75	25	45	CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 4M	21.5	26.7	44.0	Measuring Progress in Fine-grained Vision-and-Language Understanding

0 of 113 row(s) selected.

Command Palette

Visual Reasoning On Winoground

Metrics

Results