HyperAI超神经

首页算力平台文档资讯论文教程数据集百科 SOTA LLM 模型天梯 GPU 天梯顶会

中文

HyperAI超神经

Video Captioning On Msvd 1

评估指标

BLEU-4

CIDEr

METEOR

ROUGE-L

评测结果

各个模型在此基准测试上的表现结果

					Paper Title	Repository
MaMMUT	-	195.6	-	-	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VLAB	79.3	179.8	51.2	87.9	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
VALOR	80.7	178.5	51.0	87.9	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
COSA	76.5	178.5	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-2	70.5	165.8	48.4	85.3	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
HowToCaption	70.4	154.2	46.4	83.2	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HiTeA	71.0	146.9	45.3	81.4	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
Vid2Seq	-	146.2	45.3	-	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
VIOLETv2	-	139.2	-	-	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
RTQ	66.9	123.4	-	82.2	RTQ: Rethinking Video-language Understanding Based on Image-text Model
CoCap (ViT/L14)	60.1	121.5	41.4	78.2	Accurate and Fast Compressed Video Captioning
VASTA (Vatex-backbone)	59.2	119.7	40.65	76.7	Diverse Video Captioning by Adaptive Spatio-temporal Attention
IcoCap (ViT-B/16)	59.1	110.3	39.5	76.5	IcoCap: Improving Video Captioning by Compounding Images	-
SEM-POS	60.1	108.3	38.5	76.0	SEM-POS: Grammatically and Semantically Correct Video Captioning	-
VASTA (Kinetics-backbone)	56.1	106.4	39.1	74.5	Diverse Video Captioning by Adaptive Spatio-temporal Attention
IcoCap (ViT-B/32)	56.3	103.8	38.9	75.0	IcoCap: Improving Video Captioning by Compounding Images	-

0 of 16 row(s) selected.

Video Captioning On Msvd 1 | SOTA | HyperAI超神经