Command Palette
Search for a command to run...
THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING
{Kai Yu Mengyue Wu Zeyu Xie Xuenan Xu}

Abstract
This report proposes an audio captioning system for the Detectionand Classification of Acoustic Scenes and Events (DCASE) 2021challenge task Task 6. Our audio captioning system consists of a10-layer convolution neural network (CNN) encoder and a tempo-ral attentional single layer gated recurrent unit (GRU) decoder. Inthis challenge, there is no restriction on the usage of external dataand pre-trained models. To better model the concepts in an audioclip, we pre-train the CNN encoder with audio tagging on AudioSet.After standard cross entropy based training, we further fine-tune themodel with reinforcement learning to directly optimize the evalua-tion metric. Experiments show that our proposed system achieves aSPIDEr of 28.6 on the public evaluation split without ensemble1.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-captioning-on-clotho | Ensemble-RL | CIDEr: 0.468 SPICE: 0.123 SPIDEr: 0.295 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.