Video Based Generative Performance
Metrics
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | Consistency | Contextual Understanding | Correctness of Information | Detail Orientation | Temporal Understanding | mean |
---|---|---|---|---|---|---|
mvbench-a-comprehensive-multi-modal-video | 2.84 | 3.72 | 3.40 | 2.91 | 2.65 | 3.10 |
one-for-all-video-conversation-is-feasible | 2.2 | 2.89 | 2.16 | 2.46 | 2.13 | 2.46 |
ts-llava-constructing-visual-tokens-through | - | - | - | - | - | 3.38 |
llama-vid-an-image-is-worth-2-tokens-in-large | 2.51 | 3.53 | 2.96 | 3.00 | 2.46 | 2.89 |
llama-vid-an-image-is-worth-2-tokens-in-large | 2.63 | 3.60 | 3.07 | 3.05 | 2.58 | 2.99 |
llama-adapter-v2-parameter-efficient-visual | 2.15 | 2.30 | 2.03 | 2.32 | 1.98 | 2.16 |
one-for-all-video-conversation-is-feasible | 2.46 | 3.27 | 2.68 | 2.69 | 2.34 | 2.69 |
tuning-large-multimodal-models-for-videos | 3.32 | 4 | 3.63 | 3.25 | 3.23 | 3.49 |
mvbench-a-comprehensive-multi-modal-video | 2.81 | 3.51 | 3.02 | 2.88 | 2.66 | 2.98 |
cat-enhancing-multimodal-large-language-model | 2.89 | 3.49 | 3.08 | 2.95 | 2.81 | 3.07 |
ppllava-varied-video-sequence-understanding | 3.20 | 3.88 | 3.32 | 3.20 | 3.0 | 3.32 |
pllava-parameter-free-llava-extension-from-1 | 3.25 | 3.90 | 3.60 | 3.20 | 2.67 | 3.32 |
videogpt-integrating-image-and-video-encoders | 3.39 | 3.74 | 3.27 | 3.18 | 2.83 | 3.28 |
chat-univi-unified-visual-representation | 2.81 | 3.46 | 2.89 | 2.91 | 2.39 | 2.99 |
lita-language-instructed-temporal | 3.19 | 3.43 | 2.94 | 2.98 | 2.68 | 3.04 |
ppllava-varied-video-sequence-understanding | 3.81 | 4.21 | 3.85 | 3.56 | 3.21 | 3.73 |
videochat-chat-centric-video-understanding | 2.24 | 2.53 | 2.23 | 2.50 | 1.94 | 2.29 |
slowfast-llava-a-strong-training-free | - | - | - | - | - | 3.32 |
st-llm-large-language-models-are-effective-1 | 2.81 | 3.74 | 3.23 | 3.05 | 2.93 | 3.15 |
video-llama-an-instruction-tuned-audio-visual | 1.79 | 2.16 | 1.96 | 2.18 | 1.82 | 1.98 |
vtimellm-empower-llm-to-grasp-video-moments | 2.47 | 3.40 | 2.78 | 3.10 | 2.49 | 2.85 |
video-chatgpt-towards-detailed-video | 2.37 | 2.62 | 2.4 | 2.52 | 1.98 | 2.38 |
an-image-grid-can-be-worth-a-video-zero-shot | 3.13 | 3.61 | 3.40 | 2.80 | 2.89 | 3.17 |