3 months ago

Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer

{Sung Won Han Seungjin Lee Dongwon Kim Jin Sob Kim Hyun Joon Park WooSeok Shin}

Abstract

The performance of automated audio captioning (AAC) has been improved considerably through a transformer-based encoder and transfer learning. However, their performance improvement is constrained by the following problems: (1) discrepancy in the input patch size between pretraining and fine-tuning steps. (2) lack of local-level relations between inputs and captions. In this paper, we propose a simple transfer learning scheme that maintains input patch sizes, unlike previous methods, to avoid input discrepancies. Furthermore, we propose a patch-wise keyword estimation branch that utilizes an attention pooling method to effectively represent both global- and local-level information. The results on the AudioCaps dataset reveal that the proposed learning scheme and method considerably contribute to performance gain. Finally, the visualization results demonstrate that the proposed attention-pooling method effectively detects local-level information in the AAC system.

Benchmarks

Benchmark	Methodology	Metrics
audio-captioning-on-audiocaps	Rethink-ACT (AST + TF + MIL)	BLEU-4: 0.285 CIDEr: 0.764 METEOR: 0.242 ROUGE-L: 0.504 SPICE: 0.180 SPIDEr: 0.472

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning