HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu; Xiaohan Wang; Haipeng Luo; Jingdong Wang; Yi Yang; Wanli Ouyang

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Abstract

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

Code Repositories

whwu95/Cap4Video
pytorch
Mentioned in GitHub
whwu95/text4vis
pytorch
Mentioned in GitHub
whwu95/BIKE
Official
pytorch
Mentioned in GitHub
whwu95/GPT4Vis
Mentioned in GitHub
whwu95/ATM
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-charadesBIKE
MAP: 50.7
action-classification-on-kinetics-400BIKE (CLIP ViT-L/14)
Acc@1: 88.7
Acc@5: 98.4
action-recognition-in-videos-on-activitynetBIKE
mAP: 96.1
action-recognition-in-videos-on-hmdb-51BIKE
Average accuracy of 3 splits: 83.1
action-recognition-in-videos-on-ucf101BIKE
3-fold Accuracy: 98.8
zero-shot-action-recognition-on-activitynetBIKE
Top-1 Accuracy: 86.2
zero-shot-action-recognition-on-hmdb51BIKE
Top-1 Accuracy: 61.4
zero-shot-action-recognition-on-kineticsBIKE
Top-1 Accuracy: 68.5
Top-5 Accuracy: 91.1
zero-shot-action-recognition-on-ucf101BIKE
Top-1 Accuracy: 86.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | Papers | HyperAI