Command Palette
Search for a command to run...
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan Tao Zhu Zirui Wang Yuan Cao Mi Zhang Soham Ghosh Yonghui Wu Jiahui Yu

Abstract
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-captioning-on-activitynet-captions | VideoCoCa | BLEU4: 14.7 CIDEr: 39.3 ROUGE-L: 35.0 |
| video-captioning-on-msr-vtt-1 | VideoCoCa | BLEU-4: 53.8 CIDEr: 73.2 ROUGE-L: 68.0 |
| video-captioning-on-vatex-1 | VideoCoCa | BLEU-4: 39.7 CIDEr: 77.8 ROUGE-L: 54.5 |
| video-captioning-on-youcook2 | VideoCoCa | BLEU-4: 14.2 CIDEr: 1.28 ROUGE-L: 37.7 |
| video-question-answering-on-activitynet-qa | VideoCoCa | Accuracy: 56.1 |
| video-question-answering-on-ivqa | VideoCoCa | Accuracy: 39.0 |
| video-retrieval-on-msr-vtt | VideoCoCa (zero-shot) | text-to-video R@1: 34.3 text-to-video R@10: 67.0 text-to-video R@5: 57.8 video-to-text R@1: 64.7 video-to-text R@10: 91.4 video-to-text R@5: 85.2 |
| video-retrieval-on-youcook2 | VideoCoCa (zero-shot) | text-to-video R@1: 21.7 text-to-video R@10: 55.2 text-to-video R@5: 43.9 |
| visual-question-answering-on-msrvtt-qa-1 | VideoCoCa | Accuracy: 0.463 |
| visual-question-answering-on-msvd-qa-1 | VideoCoCa | Accuracy: 0.569 |
| zero-shot-action-recognition-on-charades-1 | VideoCoCa | mAP: 25.8 |
| zero-shot-action-recognition-on-hmdb51 | VideoCoCa | Top-1 Accuracy: 58.7 Top-5 Accuracy: 84.5 |
| zero-shot-action-recognition-on-kinetics | VideoCoCa | Top-1 Accuracy: 70.1 Top-5 Accuracy: 88.9 |
| zero-shot-action-recognition-on-ucf101 | VideoCoCa | Top-1 Accuracy: 86.6 Top-5 accuracy: 98.4 |
| zero-shot-video-retrieval-on-activitynet | VideoCoCa | text-to-video R@1: 34.5 text-to-video R@10: 76.6 text-to-video R@5: 63.2 video-to-text R@1: 33.0 video-to-text R@10: 75.3 video-to-text R@5: 61.6 |
| zero-shot-video-retrieval-on-msr-vtt-full | VideoCoCa | text-to-video R@1: 34.3 text-to-video R@10: 67.0 text-to-video R@5: 57.8 video-to-text R@1: 64.7 video-to-text R@10: 91.4 video-to-text R@5: 85.2 |
| zero-shot-video-retrieval-on-vatex | VideoCoCa | text-to-video R@1: 53.2 text-to-video R@10: 90.1 text-to-video R@5: 83.3 video-to-text R@1: 73.6 video-to-text R@10: 97.2 video-to-text R@5: 93.2 |
| zero-shot-video-retrieval-on-youcook2 | VideoCOca | text-to-video R@1: 20.3 text-to-video R@10: 53.3 text-to-video R@5: 43.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.