5 months ago

Learning Video Representations from Large Language Models

Zhao Yue ; Misra Ishan ; Krähenbühl Philipp ; Girdhar Rohit

Abstract

We introduce LaViLa, a new approach to learning video-languagerepresentations by leveraging Large Language Models (LLMs). We repurposepre-trained LLMs to be conditioned on visual input, and finetune them to createautomatic video narrators. Our auto-generated narrations offer a number ofadvantages, including dense coverage of long videos, better temporalsynchronization of the visual information and text, and much higher diversityof text. The video-text embedding learned contrastively with these additionalauto-generated narrations outperforms the previous state-of-the-art on multiplefirst-person and third-person video tasks, both in zero-shot and finetunedsetups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEAclassification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks.Furthermore, LaViLa trained with only half the narrations from the Ego4Ddataset outperforms baseline models trained on the full set, and shows positivescaling behavior on increasing pre-training data and model size.

Code Repositories

facebookresearch/lavila

Official

pytorch

Mentioned in GitHub

Ziyang412/VideoTree

pytorch

Mentioned in GitHub

ceezh/llovi

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-recognition-on-charades-ego	LaViLa (Zero-shot, TimeSformer-L)	mAP: 28.9
action-recognition-on-charades-ego	LaViLa (Finetuned, TimeSformer-L)	mAP: 36.1
action-recognition-on-epic-kitchens-100	LaViLa (TimeSformer-L)	Action@1: 51 Noun@1: 62.9 Verb@1: 72
egocentric-activity-recognition-on-egtea-1	LaViLa (Finetuned, TimeSformer-L)	Average Accuracy: 81.75 Mean class accuracy: 76

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Learning Video Representations from Large Language Models

Zhao Yue ; Misra Ishan ; Kr&#xe4;henb&#xfc;hl Philipp ; Girdhar Rohit

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Zhao Yue ; Misra Ishan ; Krähenbühl Philipp ; Girdhar Rohit