Command Palette
Search for a command to run...
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar Pritam ; Etemad Ali

Abstract
We present CrissCross, a self-supervised framework for learning audio-visualrepresentations. A novel notion is introduced in our framework whereby inaddition to learning the intra-modal and standard 'synchronous' cross-modalrelations, CrissCross also learns 'asynchronous' cross-modal relationships. Weperform in-depth studies showing that by relaxing the temporal synchronicitybetween the audio and visual modalities, the network learns strong generalizedrepresentations useful for a variety of downstream tasks. To pretrain ourproposed solution, we use 3 different datasets with varying sizes,Kinetics-Sound, Kinetics400, and AudioSet. The learned representations areevaluated on a number of downstream tasks namely action recognition, soundclassification, and action retrieval. Our experiments show that CrissCrosseither outperforms or achieves performances on par with the currentstate-of-the-art self-supervised methods on action recognition and actionretrieval with UCF101 and HMDB51, as well as sound classification with ESC50and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining whilepretrained on Kinetics-Sound. The codes and pretrained models are available onthe project website.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-classification-on-dcase | CrissCross (Kinetics-400) | PRE-TRAINING DATASET: Kinetics-400 Top-1 Accuracy: 96 |
| audio-classification-on-dcase | CrissCross (AudioSet) | PRE-TRAINING DATASET: AudioSet Top-1 Accuracy: 97 |
| audio-classification-on-dcase | CrissCross (Kinetics-Sound) | PRE-TRAINING DATASET: Kinetics-Sound Top-1 Accuracy: 93 |
| self-supervised-action-recognition-on-hmdb51 | CrissCross (AudioSet) | Frozen: false Pre-Training Dataset: AudioSet Top-1 Accuracy: 66.8 |
| self-supervised-action-recognition-on-hmdb51 | CrissCross (Kinetics400) | Frozen: false Pre-Training Dataset: Kinetics400 Top-1 Accuracy: 64.7 |
| self-supervised-action-recognition-on-hmdb51 | CrissCross (Kinetics-Sound) | Frozen: false Pre-Training Dataset: Kinetics-Sound Top-1 Accuracy: 60.5 |
| self-supervised-action-recognition-on-ucf101 | CrissCross (Kinetics400) | 3-fold Accuracy: 91.5 Frozen: false Pre-Training Dataset: Kinetics400 |
| self-supervised-action-recognition-on-ucf101 | CrissCross (Kinetics-Sound) | 3-fold Accuracy: 88.3 Frozen: false Pre-Training Dataset: Kinetics-Sound |
| self-supervised-action-recognition-on-ucf101 | CrissCross (AudioSet) | 3-fold Accuracy: 92.4 Frozen: false Pre-Training Dataset: AudioSet |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.