5 months ago

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu; Xiaohan Wang; Haipeng Luo; Jingdong Wang; Yi Yang; Wanli Ouyang

Abstract

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

Code Repositories

whwu95/Cap4Video

pytorch

Mentioned in GitHub

whwu95/text4vis

pytorch

Mentioned in GitHub

whwu95/BIKE

Official

pytorch

Mentioned in GitHub

whwu95/GPT4Vis

Mentioned in GitHub

whwu95/ATM

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-charades	BIKE	MAP: 50.7
action-classification-on-kinetics-400	BIKE (CLIP ViT-L/14)	Acc@1: 88.7 Acc@5: 98.4
action-recognition-in-videos-on-activitynet	BIKE	mAP: 96.1
action-recognition-in-videos-on-hmdb-51	BIKE	Average accuracy of 3 splits: 83.1
action-recognition-in-videos-on-ucf101	BIKE	3-fold Accuracy: 98.8
zero-shot-action-recognition-on-activitynet	BIKE	Top-1 Accuracy: 86.2
zero-shot-action-recognition-on-hmdb51	BIKE	Top-1 Accuracy: 61.4
zero-shot-action-recognition-on-kinetics	BIKE	Top-1 Accuracy: 68.5 Top-5 Accuracy: 91.1
zero-shot-action-recognition-on-ucf101	BIKE	Top-1 Accuracy: 86.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu; Xiaohan Wang; Haipeng Luo; Jingdong Wang; Yi Yang; Wanli Ouyang

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters