Command Palette
Search for a command to run...
Mei Xinhao ; Liu Xubo ; Huang Qiushi ; Plumbley Mark D. ; Wang Wenwu

Abstract
Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-captioning-on-audiocaps | CNN+Transformer | CIDEr: 0.693 SPICE: 0.159 SPIDEr: 0.426 |
| retrieval-augmented-few-shot-in-context-audio | Audio captioning transformer | CIDEr: 0.149 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.