HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Video Question Answering
Video Question Answering On Msrvtt Qa
Video Question Answering On Msrvtt Qa
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Paper Title
Repository
FrozenBiLM
47.0
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
mPLUG-2
48.0
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
FrozenBiLM (0-shot)
16.7
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
VIOLETv2
44.5
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Singularity-temporal
43.9
Revealing Single Frame Bias for Video-and-Language Learning
HBI
46.2
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
VALOR
49.2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Singularity
43.5
Revealing Single Frame Bias for Video-and-Language Learning
VAST
50.1
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Mirasol3B
50.42
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
-
VindLU
44.6
VindLU: A Recipe for Effective Video-and-Language Pretraining
COSA
49.2
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
MA-LMM
48.5
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
EMCL-Net
45.8
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
0 of 14 row(s) selected.
Previous
Next