HyperAI超神经

Video Question Answering On Mvbench

评估指标

Avg.

评测结果

各个模型在此基准测试上的表现结果

模型名称
Avg.
Paper TitleRepository
ST-LLM54.9ST-LLM: Large Language Models Are Effective Temporal Learners
Tarsier (34B)67.6Tarsier: Recipes for Training and Evaluating Large Video Description Models
PPLLaVA (7b)59.2PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
MiniGPT418.8MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
VideoChat35.5VideoChat: Chat-Centric Video Understanding
Oryx(34B)64.7Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
InstructBLIP32.5InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
LongVU (7B)66.9LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
VideoLLaMA34.1Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
LinVT-Qwen2-VL (7B)69.3LinVT: Empower Your Image-level Large Language Model to Understand Videos
HawkEye47.55HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Video-ChatGPT32.7Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
SPHINX-Plus39.7SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
VideoLLaMA2 (72B)62.0VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoChat251.9MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
mPLUG-Owl3(7B)59.5mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
PLLaVA58.1PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
InternVideo267.2InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoGPT+58.7VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
LLaVa36.0Visual Instruction Tuning
0 of 21 row(s) selected.