HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Video Question Answering
Video Question Answering On Activitynet Qa
Video Question Answering On Activitynet Qa
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Paper Title
Repository
LocVLM-Vid-B+
38.2
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
E-MN
27.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VindLU
44.7
VindLU: A Recipe for Effective Video-and-Language Pretraining
Video-LLaVA
45.3
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
E-VQA
25.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VALOR
48.6
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
E-SA
31.8
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
BT-Adapter (zero-shot)
46.1
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Mirasol3B
51.13
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
-
Chat-UniVi-13B
46.4
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MA-LMM
49.8
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
FrozenBiLM+
44.8
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
MovieChat
45.7
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-ChatGPT
35.2
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
LLaMA-VID-7B (2 Token)
47.4
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
VAST
50.4
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
TESTA (ViT-B/16)
45
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
61.2
Composing Ensembles of Pre-trained Models via Iterative Consensus
-
LocVLM-Vid-B
37.4
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
VideoCoCa
56.1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
0 of 36 row(s) selected.
Previous
Next