Command Palette
Search for a command to run...
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
Liu Shansong ; Hussain Atin Sakkeer ; Sun Chenshuo ; Shan Ying

Abstract
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcityof large-scale publicly available music datasets with natural languagecaptions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA),capable of answering music-related questions and generating captions for musicfiles. Our model utilizes audio representations from a pretrained MERT model toextract music features. However, obtaining a suitable dataset for training theMU-LLaMA model remains challenging, as existing publicly accessible audioquestion answering datasets lack the necessary depth for open-ended musicquestion answering. To fill this gap, we present a methodology for generatingquestion-answer pairs from existing audio captioning datasets and introduce theMusicQA Dataset designed for answering open-ended music-related questions. Theexperiments demonstrate that the proposed MU-LLaMA model, trained on ourdesigned MusicQA dataset, achieves outstanding performance in both musicquestion answering and music caption generation across various metrics,outperforming current state-of-the-art (SOTA) models in both fields andoffering a promising advancement in the T2M-Gen research field.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| music-question-answering-on-musicqa | MU-LLaMA | BERT Score: 0.901 BLEU: 0.306 METEOR: 0.385 ROUGE: 0.466 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.