HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Bruno Korbar; Du Tran; Lorenzo Torresani

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Abstract

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-esc-50AVTS
Top-1 Accuracy: 82.3
self-supervised-action-recognition-on-hmdb51-1AVTS
Top-1 Accuracy: 61.6
self-supervised-action-recognition-on-ucf101-1AVTS
3-fold Accuracy: 89.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization | Papers | HyperAI