Command Palette
Search for a command to run...
MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations
Heggan Calum ; Hospedales Tim ; Budgett Sam ; Yaghoobi Mehrdad

Abstract
Contrastive self-supervised learning has gained attention for its ability tocreate high-quality representations from large unlabelled data sets. A keyreason that these powerful features enable data-efficient learning ofdownstream tasks is that they provide augmentation invariance, which is often auseful inductive bias. However, the amount and type of invariances preferred isnot known apriori, and varies across different downstream tasks. We thereforepropose a multi-task self-supervised framework (MT-SLVR) that learns bothvariant and invariant features in a parameter-efficient manner. Our multi-taskrepresentation provides a strong and flexible feature that benefits diversedownstream tasks. We evaluate our approach on few-shot classification tasksdrawn from a variety of audio domains and demonstrate improved classificationperformance on all of them
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| few-shot-audio-classification-on | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 39.11±0.41 |
| few-shot-audio-classification-on | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 37.64±0.40 |
| few-shot-audio-classification-on | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.72±0.34 |
| few-shot-audio-classification-on-birdclef | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.49±0.38 |
| few-shot-audio-classification-on-birdclef | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 30.93±0.38 |
| few-shot-audio-classification-on-birdclef | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.04±0.35 |
| few-shot-audio-classification-on-common-voice | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.00±0.42 |
| few-shot-audio-classification-on-common-voice | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 35.22±0.40 |
| few-shot-audio-classification-on-common-voice | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 33.33±0.38 |
| few-shot-audio-classification-on-crema-d | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.68±0.33 |
| few-shot-audio-classification-on-crema-d | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.61±0.38 |
| few-shot-audio-classification-on-crema-d | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 29.10±0.36 |
| few-shot-audio-classification-on-esc-50 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 63.40±0.39 |
| few-shot-audio-classification-on-esc-50 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 69.53±0.39 |
| few-shot-audio-classification-on-esc-50 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 37.76±0.34 |
| few-shot-audio-classification-on-nsynth | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 71.81±0.39 |
| few-shot-audio-classification-on-nsynth | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 66.44±0.40 |
| few-shot-audio-classification-on-nsynth | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 62.52±0.36 |
| few-shot-audio-classification-on-speech | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 20.08±0.37 |
| few-shot-audio-classification-on-speech | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 25.68±0.35 |
| few-shot-audio-classification-on-speech | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.65±0.34 |
| few-shot-audio-classification-on-speech-1 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 28.92±0.37 |
| few-shot-audio-classification-on-speech-1 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 23.08±0.34 |
| few-shot-audio-classification-on-speech-1 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 26.16±0.34 |
| few-shot-audio-classification-on-voxceleb1 | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 31.18±0.37 |
| few-shot-audio-classification-on-voxceleb1 | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 33.58±0.39 |
| few-shot-audio-classification-on-voxceleb1 | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 21.68±0.40 |
| few-shot-audio-classification-on-watkins | Multi-Label Augmentation Prediction (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 28.88±0.39 |
| few-shot-audio-classification-on-watkins | MT-SLVR (SimCLR + MLAP) w/ Parallel Adapters (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 59.49±0.42 |
| few-shot-audio-classification-on-watkins | SimCLR (FSD50K, RN18) | Top-1 Accuracy(5-Way-1-Shot): 52.91±0.41 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.