HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
Command Palette
Search for a command to run...
首页
SOTA
视频字幕生成
Video Captioning On Msr Vtt 1
Video Captioning On Msr Vtt 1
评估指标
BLEU-4
CIDEr
METEOR
ROUGE-L
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
BLEU-4
CIDEr
METEOR
ROUGE-L
Paper Title
Repository
mPLUG-2
57.8
80.0
34.9
70.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VAST
56.7
78.0
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
GIT2
54.8
75.9
33.1
68.2
GIT: A Generative Image-to-text Transformer for Vision and Language
VLAB
54.6
74.9
33.4
68.3
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
COSA
53.7
74.7
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR
54.4
74.0
32.9
68.0
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
MaMMUT (ours)
-
73.6
-
-
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VideoCoCa
53.8
73.2
-
68.0
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
RTQ
49.6
69.3
-
66.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
HowToCaption
49.8
65.3
32.2
66.3
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HiTeA
49.2
65.1
30.7
65.0
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
Vid2Seq
-
64.6
30.8
-
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
TextKG
46.6
60.8
30.5
64.8
Text with Knowledge Graph Augmented Transformer for Video Captioning
-
IcoCap (ViT-B/16)
47.0
60.2
31.1
64.9
IcoCap: Improving Video Captioning by Compounding Images
-
MV-GPT
48.9
60.0
38.7
64.0
End-to-end Generative Pretraining for Multimodal Video Captioning
-
IcoCap (ViT-B/32)
46.1
59.1
30.3
64.3
IcoCap: Improving Video Captioning by Compounding Images
-
CLIP-DCD
48.2
58.7
31.3
64.8
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
VIOLETv2
-
58
-
-
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CoCap (ViT/L14)
44.4
57.2
30.3
63.4
Accurate and Fast Compressed Video Captioning
VASTA (Vatex-backbone)
44.21
56.08
30.24
62.9
Diverse Video Captioning by Adaptive Spatio-temporal Attention
0 of 24 row(s) selected.
Previous
Next