HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Wang Bin ; Wang Wenqian

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer
  Learning

Abstract

Recently, large-scale pre-trained vision-language models (e.g., CLIP), havegarnered significant attention thanks to their powerful representativecapabilities. This inspires researchers in transferring the knowledge fromthese large pre-trained models to other task-specific models, e.g., VideoAction Recognition (VAR) models, via particularly leveraging side networks toenhance the efficiency of parameter-efficient fine-tuning (PEFT). However,current transferring approaches in VAR tend to directly transfer the frozenknowledge from large pre-trained models to action recognition networks withminimal cost, instead of exploiting the temporal modeling capabilities of theaction recognition models themselves. Therefore, in this paper, we propose amemory-efficient Temporal Difference Side Network (TDS-CLIP) to balanceknowledge transferring and temporal modeling, avoiding backpropagation infrozen parameter models. Specifically, we introduce a Temporal DifferenceAdapter (TD-Adapter), which can effectively capture local temporal differencesin motion features to strengthen the model's global temporal modelingcapabilities. Furthermore, we designed a Side Motion Enhancement Adapter(SME-Adapter) to guide the proposed side network in efficiently learning therich motion information in videos, thereby improving the side network's abilityto capture and learn motion information. Extensive experiments are conducted onthree benchmark datasets, including Something-Something V1\&V2, andKinetics-400. Experimental results demonstrate that our approach achievescompetitive performance.

Code Repositories

BBYL9413/TDS-CLIP
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-in-videos-on-somethingTDS-CLIP-ViT-L/14(8frames)
Top-1 Accuracy: 73.4
Top-5 Accuracy: 93.8
action-recognition-in-videos-on-something-1TDS-CLIP-ViT-L/14(8frames)
Top 1 Accuracy: 63.0
Top 5 Accuracy: 87.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp