3 months ago

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Rui Qian Yeqing Li Zheng Xu Ming-Hsuan Yang Serge Belongie Yin Cui

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-action-recognition-on-hmdb51	MOV (ViT-B/16)	Top-1 Accuracy: 60.8
zero-shot-action-recognition-on-hmdb51	MOV (ViT-L/14)	Top-1 Accuracy: 64.7
zero-shot-action-recognition-on-ucf101	MOV (ViT-B/16)	Top-1 Accuracy: 82.6
zero-shot-action-recognition-on-ucf101	MOV (ViT-L/14)	Top-1 Accuracy: 87.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning