Visual Question Answering Vqa On Vlm2 Bench
评估指标
Average Score on VLM2-bench (9 subtasks)
GC-mat
GC-trk
OC-cnt
OC-cpr
OC-grp
PC-VID
PC-cnt
PC-cpr
PC-grp
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Average Score on VLM2-bench (9 subtasks) | GC-mat | GC-trk | OC-cnt | OC-cpr | OC-grp | PC-VID | PC-cnt | PC-cpr | PC-grp | Paper Title | Repository |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mPLUG-Owl3-7B | 37.85 | 17.37 | 18.26 | 62.97 | 49.17 | 31.00 | 13.50 | 58.86 | 63.50 | 26.00 | mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | |
LLaVA-OneVision-7B | 39.35 | 16.60 | 13.70 | 56.17 | 47.22 | 27.50 | 47.25 | 46.67 | 62.00 | 37.00 | LLaVA-OneVision: Easy Visual Task Transfer | |
InternVL2.5-26B | 45.59 | 30.50 | 30.59 | 51.48 | 43.33 | 52.50 | 21.75 | 59.70 | 59.50 | 61.00 | Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | |
LLaVA-Video-7B | 43.32 | 18.53 | 12.79 | 62.47 | 54.72 | 28.50 | 59.00 | 66.91 | 62.00 | 25.00 | Video Instruction Tuning With Synthetic Data | - |
LongVA-7B | 22.59 | 14.29 | 19.18 | 42.53 | 26.67 | 18.50 | 3.75 | 38.90 | 21.50 | 18.00 | Long Context Transfer from Language to Vision | |
InternVL2.5-8B | 41.23 | 21.24 | 26.03 | 55.23 | 53.33 | 46.50 | 5.25 | 60.00 | 51.50 | 52.00 | Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling | |
Qwen2-VL-7B | 42.37 | 27.80 | 19.18 | 45.99 | 68.06 | 35.00 | 16.25 | 58.59 | 61.50 | 49.00 | Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | |
Qwen2.5-VL-7B | 54.82 | 35.91 | 43.38 | 41.72 | 71.39 | 47.50 | 46.50 | 57.98 | 80.00 | 69.00 | Qwen2.5-VL Technical Report | |
GPT-4o | 60.36 | 37.45 | 39.27 | 80.62 | 74.17 | 57.50 | 66.75 | 90.50 | 50.00 | 47.00 | GPT-4o System Card | - |
0 of 9 row(s) selected.