8 months ago

Audio and Speech Processing

Method/Architecture

Xinhao Mei Xubo Liu Qiushi Huang Mark D. Plumbley Wenwu Wang

Abstract

Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Audio and Speech Processing

Method/Architecture

Xinhao Mei Xubo Liu Qiushi Huang Mark D. Plumbley Wenwu Wang

Abstract

Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp