Video Based Generative Performance

Metrics

Consistency

Contextual Understanding

Correctness of Information

Detail Orientation

Temporal Understanding

mean

Results

Performance results of various models on this benchmark

Comparison Table

Model Name	Consistency	Contextual Understanding	Correctness of Information	Detail Orientation	Temporal Understanding	mean
mvbench-a-comprehensive-multi-modal-video	2.84	3.72	3.40	2.91	2.65	3.10
one-for-all-video-conversation-is-feasible	2.2	2.89	2.16	2.46	2.13	2.46
ts-llava-constructing-visual-tokens-through	-	-	-	-	-	3.38
llama-vid-an-image-is-worth-2-tokens-in-large	2.51	3.53	2.96	3.00	2.46	2.89
llama-vid-an-image-is-worth-2-tokens-in-large	2.63	3.60	3.07	3.05	2.58	2.99
llama-adapter-v2-parameter-efficient-visual	2.15	2.30	2.03	2.32	1.98	2.16
one-for-all-video-conversation-is-feasible	2.46	3.27	2.68	2.69	2.34	2.69
tuning-large-multimodal-models-for-videos	3.32	4	3.63	3.25	3.23	3.49
mvbench-a-comprehensive-multi-modal-video	2.81	3.51	3.02	2.88	2.66	2.98
cat-enhancing-multimodal-large-language-model	2.89	3.49	3.08	2.95	2.81	3.07
ppllava-varied-video-sequence-understanding	3.20	3.88	3.32	3.20	3.0	3.32
pllava-parameter-free-llava-extension-from-1	3.25	3.90	3.60	3.20	2.67	3.32
videogpt-integrating-image-and-video-encoders	3.39	3.74	3.27	3.18	2.83	3.28
chat-univi-unified-visual-representation	2.81	3.46	2.89	2.91	2.39	2.99
lita-language-instructed-temporal	3.19	3.43	2.94	2.98	2.68	3.04
ppllava-varied-video-sequence-understanding	3.81	4.21	3.85	3.56	3.21	3.73
videochat-chat-centric-video-understanding	2.24	2.53	2.23	2.50	1.94	2.29
slowfast-llava-a-strong-training-free	-	-	-	-	-	3.32
st-llm-large-language-models-are-effective-1	2.81	3.74	3.23	3.05	2.93	3.15
video-llama-an-instruction-tuned-audio-visual	1.79	2.16	1.96	2.18	1.82	1.98
vtimellm-empower-llm-to-grasp-video-moments	2.47	3.40	2.78	3.10	2.49	2.85
video-chatgpt-towards-detailed-video	2.37	2.62	2.4	2.52	1.98	2.38
an-image-grid-can-be-worth-a-video-zero-shot	3.13	3.61	3.40	2.80	2.89	3.17