Command Palette
Search for a command to run...
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
Rui Qian Yeqing Li Zheng Xu Ming-Hsuan Yang Serge Belongie Yin Cui

Abstract
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-action-recognition-on-hmdb51 | MOV (ViT-B/16) | Top-1 Accuracy: 60.8 |
| zero-shot-action-recognition-on-hmdb51 | MOV (ViT-L/14) | Top-1 Accuracy: 64.7 |
| zero-shot-action-recognition-on-ucf101 | MOV (ViT-B/16) | Top-1 Accuracy: 82.6 |
| zero-shot-action-recognition-on-ucf101 | MOV (ViT-L/14) | Top-1 Accuracy: 87.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.