HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Zero-Shot Video Retrieval
Zero Shot Video Retrieval On Didemo
Zero Shot Video Retrieval On Didemo
Metrics
text-to-video R@1
text-to-video R@10
text-to-video R@5
Results
Performance results of various models on this benchmark
Columns
Model Name
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
Singularity-5M
36.9
69.3
61.1
Revealing Single Frame Bias for Video-and-Language Learning
-
InternVideo2-6B
57.9
84.6
80.0
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
-
BT-Adapter
35.6
72.6
61.9
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
-
LanguageBind(ViT-H/14)
39.9
74.6
66.1
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
HiTeA-17M
43.2
79.0
69.3
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
Clover
29.5
66.3
55.2
Clover: Towards A Unified Video-Language Alignment and Fusion Model
-
LanguageBind(ViT-L/14)
39.7
73.8
65.5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
mPLUG-2
45.7
79.2
71.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
-
VAST
55.5
79.6
74.3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
-
Singularity-17M
37.1
69.9
61.7
Revealing Single Frame Bias for Video-and-Language Learning
-
VIOLET
23.5
59.8
49.8
-
-
MILES
27.2
63.6
50.3
-
-
GRAM
54.2
80.7
-
Gramian Multimodal Representation Learning and Alignment
-
ALPRO
23.8
57.9
47.3
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
-
InternVideo
31.5
68.2
57.6
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
-
VideoCLIP
16.6
-
46.9
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
-
FROZEN
21.1
56.2
46.0
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
-
Y. Ge et. al.
25.6
61.1
50.6
Bridging Video-text Retrieval with Multiple Choice Questions
-
HiTeA-5M
36.1
70.3
60.1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
OA-Trans
23.5
59.8
50.4
-
-
0 of 26 row(s) selected.
Previous
Next
Zero Shot Video Retrieval On Didemo | SOTA | HyperAI