Home Console Docs News Papers Tutorials Datasets Wiki SOTA LLM Models GPU Leaderboard Events

English

Visual Question Answering Vqa On Core Mm

Metrics

Abductive

Analogical

Deductive

Overall score

Params

Results

Performance results of various models on this benchmark

						Paper Title	Repository
GPT-4V	77.88	69.86	74.86	74.44	-	GPT-4 Technical Report
SPHINX v2	49.85	20.69	42.17	39.48	16B	SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
LLaVA-1.5	47.91	24.31	30.94	32.62	13B	Improved Baselines with Visual Instruction Tuning
CogVLM-Chat	47.88	28.75	36.75	37.16	17B	CogVLM: Visual Expert for Pretrained Language Models
LLaMA-Adapter V2	46.12	22.08	28.7	30.46	7B	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Qwen-VL-Chat	44.39	30.42	37.55	37.39	16B	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
InstructBLIP	37.76	20.56	27.56	28.02	8B	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Emu	36.57	18.19	28.9	28.24	14B	Emu: Generative Pretraining in Multimodality
InternLM-XComposer-VL	35.97	18.61	26.77	26.84	9B	InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Otter	33.64	13.33	22.49	22.69	7B	Otter: A Multi-Modal Model with In-Context Instruction Tuning
mPLUG-Owl2	20.6	7.64	23.43	20.05	7B	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
BLIP-2-OPT2.7B	18.96	7.5	2.76	19.31	3B	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
MiniGPT-v2	13.28	5.69	11.02	10.43	8B	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
OpenFlamingo-v2	5.3	1.11	8.88	6.82	9B	OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

0 of 14 row(s) selected.

Visual Question Answering Vqa On Core Mm | SOTA | HyperAI