8 months ago

Abstract

We present CrissCross, a self-supervised framework for learning audio-visualrepresentations. A novel notion is introduced in our framework whereby inaddition to learning the intra-modal and standard 'synchronous' cross-modalrelations, CrissCross also learns 'asynchronous' cross-modal relationships. Weperform in-depth studies showing that by relaxing the temporal synchronicitybetween the audio and visual modalities, the network learns strong generalizedrepresentations useful for a variety of downstream tasks. To pretrain ourproposed solution, we use 3 different datasets with varying sizes,Kinetics-Sound, Kinetics400, and AudioSet. The learned representations areevaluated on a number of downstream tasks namely action recognition, soundclassification, and action retrieval. Our experiments show that CrissCrosseither outperforms or achieves performances on par with the currentstate-of-the-art self-supervised methods on action recognition and actionretrieval with UCF101 and HMDB51, as well as sound classification with ESC50and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining whilepretrained on Kinetics-Sound. The codes and pretrained models are available onthe project website.

Source PDF View Code