Command Palette
Search for a command to run...
Heggan Calum ; Budgett Sam ; Hospedales Timothy ; Yaghoobi Mehrdad

Abstract
Currently available benchmarks for few-shot learning (machine learning withfew training examples) are limited in the domains they cover, primarilyfocusing on image classification. This work aims to alleviate this reliance onimage-based benchmarks by offering the first comprehensive, public and fullyreproducible audio based alternative, covering a variety of sound domains andexperimental settings. We compare the few-shot classification performance of avariety of techniques on seven audio datasets (spanning environmental sounds tohuman-speech). Extending this, we carry out in-depth analyses of joint training(where all datasets are used during training) and cross-dataset adaptationprotocols, establishing the possibility of a generalised audio few-shotclassification algorithm. Our experimentation shows gradient-basedmeta-learning methods such as MAML and Meta-Curvature consistently outperformboth metric and baseline methods. We also demonstrate that the joint trainingroutine helps overall generalisation for the environmental sound databasesincluded, as well as being a somewhat-effective method of tackling thecross-dataset/domain setting.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| few-shot-audio-classification-on | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 40.27 +- 0.44 |
| few-shot-audio-classification-on | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 43.45 +- 0.46 |
| few-shot-audio-classification-on | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 38.78 +- 0.41 |
| few-shot-audio-classification-on | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 39.44 +- 0.44 |
| few-shot-audio-classification-on | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 43.18 +- 0.45 |
| few-shot-audio-classification-on | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 33.52 +- 0.39 |
| few-shot-audio-classification-on | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 42.05 +- 0.42 |
| few-shot-audio-classification-on-birdclef | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 56.11 +- 0.46 |
| few-shot-audio-classification-on-birdclef | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 36.41 +- 0.42 |
| few-shot-audio-classification-on-birdclef | SimpleShot Cl2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 57.66 +- 0.43 |
| few-shot-audio-classification-on-birdclef | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 33.04 +- 0.41 |
| few-shot-audio-classification-on-birdclef | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 56.26 +- 0.45 |
| few-shot-audio-classification-on-birdclef | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 61.34 +- 0.46 |
| few-shot-audio-classification-on-birdclef | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 57.28 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 60.41 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 68.83 +- 0.38 |
| few-shot-audio-classification-on-esc-50 | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 76.17 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 64.48 +- 0.41 |
| few-shot-audio-classification-on-esc-50 | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 74.66 ± 0.42 |
| few-shot-audio-classification-on-esc-50 | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 68.82 +-0.39 |
| few-shot-audio-classification-on-esc-50 | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 71.72 +- 0.38 |
| few-shot-audio-classification-on-nsynth | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 93.85 +- 0.24 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N Classifier (AST pre-trained w/ ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 66.68 +- 0.41 |
| few-shot-audio-classification-on-nsynth | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 90.74 +- 0.25 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N Classifier (AST ImageNet & AudioSet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 63.78 +- 0.42 |
| few-shot-audio-classification-on-nsynth | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 90.04 +- 0.27 |
| few-shot-audio-classification-on-nsynth | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 96.47 +-0.19 |
| few-shot-audio-classification-on-nsynth | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 95.23 +- 0.19 |
| few-shot-audio-classification-on-voxceleb1 | Meta-Curvature (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 63.85 +- 0.44 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 28.09 +- 0.37 |
| few-shot-audio-classification-on-voxceleb1 | Prototypical Networks (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 59.64 +- 0.44 |
| few-shot-audio-classification-on-voxceleb1 | Meta-Baseline (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 55.54 +- 0.42 |
| few-shot-audio-classification-on-voxceleb1 | MAML (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 60.89 +- 0.45 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (CRNN) | Top-1 Accuracy(5-Way-1-Shot): 48.50 +- 0.42 |
| few-shot-audio-classification-on-voxceleb1 | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 28.79 +- 0.38 |
| few-shot-audio-classification-on-watkins | SimpleShot CL2N (AST ImageNet & AudioSet- No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 51.81 ± 0.42 |
| few-shot-audio-classification-on-watkins | SimpleShot CL2N (AST ImageNet - No fine-tune) | Top-1 Accuracy(5-Way-1-Shot): 55.40 ± 0.42 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.