HyperAI

Video Based Generative Performance

Metrics

Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean

Results

Performance results of various models on this benchmark

Comparison Table
Model NameConsistencyContextual UnderstandingCorrectness of InformationDetail OrientationTemporal Understandingmean
mvbench-a-comprehensive-multi-modal-video2.843.723.402.912.653.10
one-for-all-video-conversation-is-feasible2.22.892.162.462.132.46
ts-llava-constructing-visual-tokens-through-----3.38
llama-vid-an-image-is-worth-2-tokens-in-large2.513.532.963.002.462.89
llama-vid-an-image-is-worth-2-tokens-in-large2.633.603.073.052.582.99
llama-adapter-v2-parameter-efficient-visual2.152.302.032.321.982.16
one-for-all-video-conversation-is-feasible2.463.272.682.692.342.69
tuning-large-multimodal-models-for-videos3.3243.633.253.233.49
mvbench-a-comprehensive-multi-modal-video2.813.513.022.882.662.98
cat-enhancing-multimodal-large-language-model2.893.493.082.952.813.07
ppllava-varied-video-sequence-understanding3.203.883.323.203.03.32
pllava-parameter-free-llava-extension-from-13.253.903.603.202.673.32
videogpt-integrating-image-and-video-encoders3.393.743.273.182.833.28
chat-univi-unified-visual-representation2.813.462.892.912.392.99
lita-language-instructed-temporal3.193.432.942.982.683.04
ppllava-varied-video-sequence-understanding3.814.213.853.563.213.73
videochat-chat-centric-video-understanding2.242.532.232.501.942.29
slowfast-llava-a-strong-training-free-----3.32
st-llm-large-language-models-are-effective-12.813.743.233.052.933.15
video-llama-an-instruction-tuned-audio-visual1.792.161.962.181.821.98
vtimellm-empower-llm-to-grasp-video-moments2.473.402.783.102.492.85
video-chatgpt-towards-detailed-video2.372.622.42.521.982.38
an-image-grid-can-be-worth-a-video-zero-shot3.133.613.402.802.893.17