HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac; Adrià Recasens; Rosalia Schneider; Relja Arandjelović; Jason Ramapuram; Jeffrey De Fauw; Lucas Smaira; Sander Dieleman; Andrew Zisserman

Self-Supervised MultiModal Versatile Networks

Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
audio-classification-on-audiosetMMV
Test mAP: 0.309
self-supervised-action-recognition-onMMV
Top-1 Accuracy: 55.5
self-supervised-action-recognition-on-hmdb51-1MMV
Top-1 Accuracy: 70.1
self-supervised-action-recognition-on-ucf101MMV TSM-50x2
3-fold Accuracy: 95.2
Frozen: false
Pre-Training Dataset: Audioset + Howto100M
self-supervised-action-recognition-on-ucf101-1MMV
3-fold Accuracy: 91.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp