Command Palette
Search for a command to run...
Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
{Sung Won Han Seungjin Lee Dongwon Kim Jin Sob Kim Hyun Joon Park WooSeok Shin}
Abstract
The performance of automated audio captioning (AAC) has been improved considerably through a transformer-based encoder and transfer learning. However, their performance improvement is constrained by the following problems: (1) discrepancy in the input patch size between pretraining and fine-tuning steps. (2) lack of local-level relations between inputs and captions. In this paper, we propose a simple transfer learning scheme that maintains input patch sizes, unlike previous methods, to avoid input discrepancies. Furthermore, we propose a patch-wise keyword estimation branch that utilizes an attention pooling method to effectively represent both global- and local-level information. The results on the AudioCaps dataset reveal that the proposed learning scheme and method considerably contribute to performance gain. Finally, the visualization results demonstrate that the proposed attention-pooling method effectively detects local-level information in the AAC system.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-captioning-on-audiocaps | Rethink-ACT (AST + TF + MIL) | BLEU-4: 0.285 CIDEr: 0.764 METEOR: 0.242 ROUGE-L: 0.504 SPICE: 0.180 SPIDEr: 0.472 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.