3 months ago

AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS

{Christophe Cerisara Romain Serizel F ́elix Gontier}

Abstract

utomated audio captioning is the multimodal task of describingenvironmental audio recordings with fluent natural language. Mostcurrent methods utilize pre-trained analysis models to extract rele-vant semantic content from the audio input. However, prior infor-mation on language modeling is rarely introduced, and correspond-ing architectures are limited in capacity due to data scarcity. Inthis paper, we present a method leveraging the linguistic informa-tion contained in BART, a large-scale conditional language modelwith general purpose pre-training. The caption generation is condi-tioned on sequences of textual AudioSet tags. This input is enrichedwith temporally aligned audio embeddings that allows the model toimprove the sound event recognition. The full BART architectureis fine-tuned with few additional parameters. Experimental resultsdemonstrate that, beyond the scaling properties of the architecture,language-only pre-training improves the text quality in the multi-modal setting of audio captioning. The best model achieves state-of-the-art performance on AudioCaps with 46.5 SPIDEr.

Benchmarks

Benchmark	Methodology	Metrics
audio-captioning-on-audiocaps	BART + YAMNet + PANNs	CIDEr: 0.753 SPICE: 0.176 SPIDEr: 0.465
retrieval-augmented-few-shot-in-context-audio	Automated audio captioning by fine-tuning bart with audioset tags	CIDEr: 0.147

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette