Command Palette
Search for a command to run...
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS
{Christophe Cerisara Romain Serizel F ́elix Gontier}

Abstract
utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-captioning-on-audiocaps | BART + YAMNet + PANNs | CIDEr: 0.753 SPICE: 0.176 SPIDEr: 0.465 |
| retrieval-augmented-few-shot-in-context-audio | Automated audio captioning by fine-tuning bart with audioset tags | CIDEr: 0.147 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.