HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multimodal Pretraining for Dense Video Captioning

Gabriel Huang Bo Pang Zhenhai Zhu Clara Rivera Radu Soricut

Multimodal Pretraining for Dense Video Captioning

Abstract

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
dense-video-captioning-on-youcook2E2vidD6-MASSalign-BiD
ROUGE-L: 39.03
video-captioning-on-youcook2E2vidD6-MASSvid-BiD
BLEU-4: 12.04
CIDEr: 1.22
METEOR: 18.32
ROUGE-L: 39.03

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp