HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Contrastive Audio-Visual Masked Autoencoder

Yuan Gong; Andrew Rouditchenko; Alexander H. Liu; David Harwath; Leonid Karlinsky; Hilde Kuehne; James Glass

Contrastive Audio-Visual Masked Autoencoder

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

Code Repositories

yuangongnd/cav-mae
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-audiosetCAV-MAE (Audio-Visual)
Test mAP: 0.512
audio-classification-on-audiosetCAV-MAE (Audio-Only)
Test mAP: 0.466
audio-classification-on-audiosetCAV-MAE (Visual-Only)
Test mAP: 0.262
audio-classification-on-vggsoundCAV-MAE (Audio-Visual)
Top 1 Accuracy: 65.9
audio-classification-on-vggsoundCAV-MAE (Audio-Only)
Top 1 Accuracy: 59.5
audio-tagging-on-audiosetCAV-MAE (Audio-Visual)
mean average precision: 0.512
audio-tagging-on-audiosetCAV-MAE (Audio-Only)
mean average precision: 0.466
multi-modal-classification-on-audiosetCAV-MAE
Average mAP: 0.512
multi-modal-classification-on-vgg-soundCAV-MAE (Audio-Visual)
Top-1 Accuracy: 65.9
sound-prompted-semantic-segmentation-onCAVMAE
mAP: 26.0
mIoU: 17.0
speech-prompted-semantic-segmentation-onCAVMAE
mAP: 27.2
mIoU: 19.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp