Command Palette
Search for a command to run...
Zhao Yue ; Misra Ishan ; Krähenbühl Philipp ; Girdhar Rohit

Abstract
We introduce LaViLa, a new approach to learning video-languagerepresentations by leveraging Large Language Models (LLMs). We repurposepre-trained LLMs to be conditioned on visual input, and finetune them to createautomatic video narrators. Our auto-generated narrations offer a number ofadvantages, including dense coverage of long videos, better temporalsynchronization of the visual information and text, and much higher diversityof text. The video-text embedding learned contrastively with these additionalauto-generated narrations outperforms the previous state-of-the-art on multiplefirst-person and third-person video tasks, both in zero-shot and finetunedsetups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEAclassification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks.Furthermore, LaViLa trained with only half the narrations from the Ego4Ddataset outperforms baseline models trained on the full set, and shows positivescaling behavior on increasing pre-training data and model size.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-recognition-on-charades-ego | LaViLa (Zero-shot, TimeSformer-L) | mAP: 28.9 |
| action-recognition-on-charades-ego | LaViLa (Finetuned, TimeSformer-L) | mAP: 36.1 |
| action-recognition-on-epic-kitchens-100 | LaViLa (TimeSformer-L) | Action@1: 51 Noun@1: 62.9 Verb@1: 72 |
| egocentric-activity-recognition-on-egtea-1 | LaViLa (Finetuned, TimeSformer-L) | Average Accuracy: 81.75 Mean class accuracy: 76 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.