HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Video Based Generative Performance
Video Based Generative Performance
Video Based Generative Performance
评估指标
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Consistency
Contextual Understanding
Correctness of Information
Detail Orientation
Temporal Understanding
mean
Paper Title
Repository
VideoChat2_HD_mistral
2.84
3.72
3.40
2.91
2.65
3.10
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
BT-Adapter (zero-shot)
2.2
2.89
2.16
2.46
2.13
2.46
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
TS-LLaVA-34B
-
-
-
-
-
3.38
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
LLaMA-VID-7B (2 Token)
2.51
3.53
2.96
3.00
2.46
2.89
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-13B (2 Token)
2.63
3.60
3.07
3.05
2.58
2.99
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA Adapter
2.15
2.30
2.03
2.32
1.98
2.16
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
BT-Adapter
2.46
3.27
2.68
2.69
2.34
2.69
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
VLM-RLAIF
3.32
4
3.63
3.25
3.23
3.49
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
-
VideoChat2
2.81
3.51
3.02
2.88
2.66
2.98
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CAT-7B
2.89
3.49
3.08
2.95
2.81
3.07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
PPLLaVA-7B
3.20
3.88
3.32
3.20
3.0
3.32
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
-
PLLaVA-34B
3.25
3.90
3.60
3.20
2.67
3.32
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
VideoGPT+
3.39
3.74
3.27
3.18
2.83
3.28
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Chat-UniVi
2.81
3.46
2.89
2.91
2.39
2.99
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LITA-13B
3.19
3.43
2.94
2.98
2.68
3.04
LITA: Language Instructed Temporal-Localization Assistant
PPLLaVA-7B-dpo
3.81
4.21
3.85
3.56
3.21
3.73
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
-
Video Chat
2.24
2.53
2.23
2.50
1.94
2.29
VideoChat: Chat-Centric Video Understanding
SlowFast-LLaVA-34B
-
-
-
-
-
3.32
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
ST-LLM-7B
2.81
3.74
3.23
3.05
2.93
3.15
ST-LLM: Large Language Models Are Effective Temporal Learners
Video LLaMA
1.79
2.16
1.96
2.18
1.82
1.98
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
0 of 23 row(s) selected.
Previous
Next