HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Chu Yunfei ; Xu Jin ; Zhou Xiaohuan ; Yang Qian ; Zhang Shiliang ; Yan Zhijie ; Zhou Chang ; Zhou Jingren

Qwen-Audio: Advancing Universal Audio Understanding via Unified
  Large-Scale Audio-Language Models

Abstract

Recently, instruction-following audio-language models have received broadattention for audio interaction with humans. However, the absence ofpre-trained audio models capable of handling diverse audio types and tasks hashindered progress in this field. Consequently, most existing works have onlybeen able to support a limited range of interaction capabilities. In thispaper, we develop the Qwen-Audio model and address this limitation by scalingup audio-language pre-training to cover over 30 tasks and various audio types,such as human speech, natural sounds, music, and songs, to facilitate universalaudio understanding abilities. However, directly co-training all tasks anddatasets can lead to interference issues, as the textual labels associated withdifferent datasets exhibit considerable variations due to differences in taskfocus, language, granularity of annotation, and text structure. To overcome theone-to-many interference, we carefully design a multi-task training frameworkby conditioning on a sequence of hierarchical tags to the decoder forencouraging knowledge sharing and avoiding interference through shared andspecified tags respectively. Remarkably, Qwen-Audio achieves impressiveperformance across diverse benchmark tasks without requiring any task-specificfine-tuning, surpassing its counterparts. Building upon the capabilities ofQwen-Audio, we further develop Qwen-Audio-Chat, which allows for input fromvarious audios and text inputs, enabling multi-turn dialogues and supportingvarious audio-central scenarios.

Code Repositories

qwenlm/qwen-audio
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
acoustic-scene-classification-on-cochlsceneQwen-Audio
1:1 Accuracy: 0.795
acoustic-scene-classification-on-tut-acousticQwen-Audio
1:1 Accuracy: 0.649
audio-captioning-on-clothoQwen-Audio
CIDEr: 0.441
SPICE: 0.136
SPIDEr: 0.288
audio-classification-on-vocalsoundQwen-Audio
Accuracy : 92.89
emotion-recognition-in-conversation-on-meldQwen-Audio
Accuracy: 55.70
speech-recognition-on-aishell-1Qwen-Audio
Word Error Rate (WER): 1.29
speech-recognition-on-aishell-2-test-android-1Qwen-Audio
Word Error Rate (WER): 3.3
speech-recognition-on-aishell-2-test-iosQwen-Audio
Word Error Rate (WER): 3.1
speech-recognition-on-aishell-2-test-mic-1Qwen-Audio
Word Error Rate (WER): 3.3
speech-recognition-on-librispeech-test-cleanQwen-Audio
Word Error Rate (WER): 2.0
speech-recognition-on-librispeech-test-otherQwen-Audio
Word Error Rate (WER): 4.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp