HyperAIHyperAI

Zero Shot Video Retrieval On Msvd

Metrics

text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5

Results

Performance results of various models on this benchmark

Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5
Paper TitleRepository
InternVideo2-1B58.188.483.083.396.994.3InternVideo2: Scaling Foundation Models for Multimodal Video Understanding-
vid-TLDR (UMT-L)50.085.577.675.795.190.0vid-TLDR: Training Free Token merging for Light-weight Video Transformer-
CLIP4Clip38.576.866.9---CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval-
SSML13.6647.7435.7---Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning-
MILES44.487.076.2-----
HowToCaption44.582.173.3---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale-
Y. Ge et. al.43.684.974.9---Bridging Video-text Retrieval with Multiple Choice Questions-
LanguageBind(ViT-H/14)53.987.880.472.096.391.4LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment-
VAST, HowToCaption-finetuned54.887.280.9---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale-
InternVideo2-6B59.389.684.483.197.094.2InternVideo2: Scaling Foundation Models for Multimodal Video Understanding-
UMT-L (ViT-L/16)49.084.776.974.592.889.7Unmasked Teacher: Towards Training-Efficient Video Foundation Models-
LaT36.981.068.634.479.269.0--
LanguageBind(ViT-L/14)54.188.181.169.797.991.8LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment-
InternVideo43.4--67.6--InternVideo: General Video Foundation Models via Generative and Discriminative Learning-
0 of 14 row(s) selected.
Zero Shot Video Retrieval On Msvd | SOTA | HyperAI