HyperAI超神经

Visual Question Answering On Mm Vet

评估指标

GPT-4 score

评测结果

各个模型在此基准测试上的表现结果

比较表格
模型名称GPT-4 score
list-items-one-by-one-a-new-data-source-and37.2
provision-programmatically-scaling-vision40.4
mmctagent-multi-modal-critical-thinking-agent74.24
deepseek-vl-towards-real-world-vision41.5
volcano-mitigating-multimodal-hallucination38.0
lova3-learning-to-visual-question-answering35.2
gamified-crowd-sourcing-of-high-quality-data52.43
mplug-owl2-revolutionizing-multi-modal-large36.3±0.1
llava-plus-learning-to-use-tools-for-creating27.5±0.3
mixture-of-subspaces-in-low-rank-adaptation35.2
janusflow-harmonizing-autoregression-and30.9
convllava-hierarchical-backbones-as-visual45.9
mini-gemini-mining-the-potential-of-multi53.0
infimm-hd-a-leap-forward-in-high-resolution38.9
a-stitch-in-time-saves-nine-small-vlm-is-a52.10
what-if-we-recaption-billions-of-web-images37.8
模型 1764.4
feast-your-eyes-mixture-of-resolution35.5
collavo-crayon-large-language-and-vision40.3
silkie-preference-distillation-for-large49.9
janus-pro-unified-multimodal-understanding39.8
expanding-performance-boundaries-of-open72.3
calibrated-self-rewarding-vision-language33.9
mm-instruct-generated-visual-instructions-for37.1
densefusion-1m-merging-vision-experts-for37.8
inf-llava-dual-perspective-perception-for34.5
llavolta-efficient-multi-modal-models-via30.7
mm1-5-methods-analysis-insights-from52.0
minigpt-4-enhancing-vision-language24.4±0.4
gpt-4-technical-report-167.6±0.1
enhancing-visual-language-modality-alignment31.6
模型 3278.1±0.2
generative-multimodal-models-are-in-context48.5
cogvlm-visual-expert-for-pretrained-language63.9
mousi-poly-visual-expert-vision-language38.4
llava-ph-efficient-multi-modal-assistant-with28.9
mimic-it-multi-modal-in-context-instruction24.7±0.3
deciphering-cross-modal-alignment-in-large32.2
模型 3961.8
expanding-performance-boundaries-of-open68.8
mixture-of-subspaces-in-low-rank-adaptation35.2
cross-modal-safety-mechanism-transfer-in25.6
strengthening-multimodal-large-language-model41.4
a-stitch-in-time-saves-nine-small-vlm-is-a63.20
openflamingo-an-open-source-framework-for21.8±0.1
tokenpacker-efficient-visual-projector-for29.6
camml-context-aware-multimodal-learner-for36.4
mminstruct-a-high-quality-multi-modal34.4
dynamic-llava-efficient-multimodal-large37.3
vila-on-pre-training-for-visual-language45.7
uni-moe-scaling-unified-multimodal-llms-with32.8
beyond-embeddings-the-promise-of-visual-table39.8
expanding-performance-boundaries-of-open65.0
mm1-methods-analysis-insights-from-multimodal42.1
mini-gemini-mining-the-potential-of-multi60.8
visionzip-longer-is-better-but-not-necessary32.9
video-llava-learning-united-visual-132.0
sphinx-x-scaling-data-and-parameters-for-a47.9
visionzip-longer-is-better-but-not-necessary30.2
emu3-next-token-prediction-is-all-you-need37.2
mg-llava-towards-multi-granularity-visual48.5
taco-learning-multi-modal-action-models-with45.2
llava-plus-learning-to-use-tools-for-creating35.0±0.0
aligngpt-multi-modal-large-language-models35.6
meteor-mamba-based-traversal-of-rationale-for57.3
lyra-an-efficient-and-speech-centric63.5
mammoth-vl-eliciting-multimodal-reasoning60.6
volcano-mitigating-multimodal-hallucination32.0
mm1-5-methods-analysis-insights-from41.0
mmdu-a-multi-turn-multi-image-dialog38.8
模型 7181.2±0.4
chain-of-spot-interactive-reasoning-improves37.6
aligned-vector-quantization-for-edge-cloud30.7
textit-v-guided-visual-search-as-a-core27.7
merlin-empowering-multimodal-llms-with34.9
gamified-crowd-sourcing-of-high-quality-data51.789
aligning-large-multi-modal-model-with-robust31.7±0.1
模型 7854.7
mm1-5-methods-analysis-insights-from39.8
mm1-5-methods-analysis-insights-from42.2
vlfeedback-a-large-scale-ai-feedback-dataset44.2
points-improving-your-vision-language-model50.0
an-empirical-study-of-scaling-instruct-tuned36.4
multi-modal-auto-regressive-modeling-via44.0
visionzip-longer-is-better-but-not-necessary32.6
mmfuser-multimodal-multi-layer-feature-fuser36.6
img-diff-contrastive-data-synthesis-for44.1
deepstack-deeply-stacking-visual-tokens-is39.3
looking-beyond-text-reducing-language-bias-in39.90
visionzip-longer-is-better-but-not-necessary32.6
onellm-one-framework-to-align-all-modalities29.1
internlm-xcomposer2-mastering-free-form-text51.2
minigpt-4-enhancing-vision-language22.1±0.1
xmodel-vlm-a-simple-baseline-for-multimodal21.8
sq-llava-self-questioning-for-large-vision35.5
focusllava-a-coarse-to-fine-approach-for41.3
mminstruct-a-high-quality-multi-modal37.9
a-stitch-in-time-saves-nine-small-vlm-is-a65.60
h2ovl-mississippi-vision-language-models44.7
towards-semantic-equivalence-of-tokenization48.7
gpt-4-technical-report-169.3±0.1
image-of-thought-prompting-for-visual72.2
mm1-5-methods-analysis-insights-from37.4
gpt-4-technical-report-168.6±0.1
模型 10545.3
mplug-owl3-towards-long-image-sequence40.1
tokenpacker-efficient-visual-projector-for34.1
dreamllm-synergistic-multimodal-comprehension35.9
visionzip-longer-is-better-but-not-necessary31.7
strengthening-multimodal-large-language-model36.8
llava-onevision-easy-visual-task-transfer57.5
dynamic-mixture-of-experts-an-auto-tuning33.6
internlm-xcomposer2-4khd-a-pioneering-large54.9
illume-illuminating-your-llms-to-see-draw-and37.0
aligngpt-multi-modal-large-language-models30.8
hallucination-augmented-contrastive-learning30.4
maven-an-effective-multi-granularity-hybrid30.4
vlfeedback-a-large-scale-ai-feedback-dataset50.7
omnifusion-technical-report39.40
claude-3-5-sonnet-model-card-addendum74.2±0.2
lyra-an-efficient-and-speech-centric71.4
video-lavit-unified-video-language-pre33.2
cumo-scaling-multimodal-llm-with-co-upcycled51.0
robocodex-multimodal-code-generation-for31.0
dragonfly-multi-resolution-zoom-supercharges35.9
qwen-vl-a-frontier-large-vision-language66.6±0.5
mm1-methods-analysis-insights-from-multimodal43.7
teamlora-boosting-low-rank-adaptation-with31.2
vlfeedback-a-large-scale-ai-feedback-dataset49.9
improved-baselines-with-visual-instruction31.1±0.2
mm-react-prompting-chatgpt-for-multimodal27.9±0.1
baichuan-omni-technical-report65.4
calibrated-self-rewarding-vision-language37.8
janus-pro-unified-multimodal-understanding50.0
mmar-towards-lossless-multi-modal-auto18.49
cogagent-a-visual-language-model-for-gui52.8
gemini-a-family-of-highly-capable-multimodal-164.3±0.4
qwen2-vl-enhancing-vision-language-model-s49.5
stablellava-enhanced-visual-instruction36.1
imp-highly-capable-large-multimodal-models44.6
cogvlm-visual-expert-for-pretrained-language52.8
flashsloth-lightning-multimodal-large41.9
list-items-one-by-one-a-new-data-source-and35.9
sharegpt4v-improving-large-multi-modal-models43.1
llava-onevision-easy-visual-task-transfer63.7
mini-gemini-mining-the-potential-of-multi59.3
模型 14757.4
rethinking-visual-prompting-for-multimodal35.1
deciphering-cross-modal-alignment-in-large42.9
qwen2-vl-enhancing-vision-language-model-s74.0
ferret-v2-an-improved-baseline-for-referring35.7
moai-mixture-of-all-intelligence-for-large43.7
explore-the-limits-of-omni-modal-pretraining31.4
openflamingo-an-open-source-framework-for24.8±0.2
gpt-4-technical-report-167.7±0.3
tinyllava-a-framework-of-small-scale-large32.0
expanding-performance-boundaries-of-open60.8
how-far-are-we-to-gpt-4v-closing-the-gap-to48.9
densefusion-1m-merging-vision-experts-for37.5
otterhd-a-high-resolution-multi-modality26.3
cogvlm2-visual-language-models-for-image-and71.1
internlm-xcomposer-2-5-a-versatile-large51.7
taco-learning-multi-modal-action-models-with45.7
gpt-4-technical-report-160.2±0.3
improved-baselines-with-visual-instruction36.3±0.2
h2ovl-mississippi-vision-language-models30.0
enhancing-multimodal-large-language-models38.9
vlfeedback-a-large-scale-ai-feedback-dataset44.1
lyra-an-efficient-and-speech-centric51.2
sea-supervised-embedding-alignment-for-token48.8
generative-pretraining-in-multimodality36.3±0.3
how-far-are-we-to-gpt-4v-closing-the-gap-to62.8
mammoth-vl-eliciting-multimodal-reasoning62.3
infmllm-a-unified-framework-for-visual33.4
flashsloth-lightning-multimodal-large49.0
vila-2-vila-augmented-vila50.0
blip-2-bootstrapping-language-image-pre22.4±0.2
looking-beyond-text-reducing-language-bias-in35.20
expanding-performance-boundaries-of-open48.8
janus-decoupling-visual-encoding-for-unified34.3
enhancing-large-vision-language-models-with45.0
mm1-methods-analysis-insights-from-multimodal48.7
qwen-vl-a-frontier-large-vision-language61.1±0.2
vary-scaling-up-the-vision-vocabulary-for36.2
imp-highly-capable-large-multimodal-models43.3
enhancing-large-vision-language-models-with32.6
a-comprehensive-overhaul-of-multimodal32.1
mimic-it-multi-modal-in-context-instruction24.6±0.2
vl-mamba-exploring-state-space-models-for32.6
self-supervised-visual-preference-alignment41.0
sphinx-the-joint-mixing-of-weights-tasks-and40.2
dynamic-llava-efficient-multimodal-large32.2
llava-onevision-easy-visual-task-transfer29.1
mmar-towards-lossless-multi-modal-auto27.80
cogvlm2-visual-language-models-for-image-and58.0
self-supervised-visual-preference-alignment37.2
gemini-1-5-unlocking-multimodal-understanding65.8±0.1
visual-agents-as-fast-and-slow-thinkers31.0
g-mod-exploring-mixture-of-depth-adaptation34.0
moe-llava-mixture-of-experts-for-large-vision35.9
imp-highly-capable-large-multimodal-models33.5
qwen2-vl-enhancing-vision-language-model-s62.0
improving-multi-modal-large-language-model34.8
expanding-performance-boundaries-of-open60.6
coco-is-all-you-need-for-visual-instruction37.5
provision-programmatically-scaling-vision38.5
visionzip-longer-is-better-but-not-necessary31.7
mm1-5-methods-analysis-insights-from43.7
robomamba-multimodal-state-space-model-for29.7
the-all-seeing-project-v2-towards-general41.3
crome-cross-modal-adapters-for-efficient55.1
expanding-performance-boundaries-of-open62.8
small-language-model-meets-with-reinforced29.0
sq-llava-self-questioning-for-large-vision39.7
llama-adapter-v2-parameter-efficient-visual31.4±0.1
mm-instruct-generated-visual-instructions-for32.9
phantom-of-latent-for-large-language-and70.8
模型 21858.1±0.1
mm-react-prompting-chatgpt-for-multimodal44.6±0.2
gamified-crowd-sourcing-of-high-quality-data64.954
sharegpt4v-improving-large-multi-modal-models37.6
beyond-embeddings-the-promise-of-visual-table31.8
hyperllava-dynamic-visual-and-language-expert31.0
trol-traversal-of-layers-for-large-language54.7
gemini-1-5-unlocking-multimodal-understanding76.9±0.1
linvt-empower-your-image-level-large-language23.5
to-see-is-to-believe-prompting-gpt-4v-for40.2
taco-learning-multi-modal-action-models-with50.9
textbind-multi-turn-interleaved-multimodal19.4