HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Hu Xu Gargi Ghosh Po-Yao Huang Prahal Arora Masoumeh Aminzadeh Christoph Feichtenhofer Florian Metze Luke Zettlemoyer

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Abstract

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Code Repositories

pytorch/fairseq
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
action-segmentation-on-coinVLM
Frame accuracy: 68.4
temporal-action-localization-on-crosstaskVLM
Recall: 46.5
video-captioning-on-youcook2VLM
BLEU-3: 17.78
BLEU-4: 12.27
CIDEr: 1.3869
METEOR: 18.22
ROUGE-L: 41.51
video-retrieval-on-msr-vtt-1kaVLM
text-to-video Median Rank: 4
text-to-video R@1: 28.10
text-to-video R@10: 67.40
text-to-video R@5: 55.50
video-retrieval-on-youcook2VLM
text-to-video Median Rank: 4
text-to-video R@1: 27.05
text-to-video R@10: 69.38
text-to-video R@5: 56.88

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp