HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Rowan Zellers Jiasen Lu Ximing Lu Youngjae Yu Yanpeng Zhao Mohammadreza Salehi Aditya Kusupati Jack Hessel Ali Farhadi Yejin Choi

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Abstract

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-600-
Top-1 Accuracy: 89.7
Top-5 Accuracy: 96.6
action-classification-on-kinetics-600-
Top-1 Accuracy: 91.1
Top-5 Accuracy: 97.1
action-classification-on-kinetics-600-
Top-1 Accuracy: 89.4
Top-5 Accuracy: 96.3
action-classification-on-kinetics-600-
Top-1 Accuracy: 88.1
Top-5 Accuracy: 95.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp