Command Palette
Search for a command to run...
Ji Qingfeng ; Wang Yuxin ; Sun Letong

Abstract
Recently, MLP structures have regained popularity, with MLP-Mixer standingout as a prominent example. In the field of computer vision, MLP-Mixer is notedfor its ability to extract data information from both channel and tokenperspectives, effectively acting as a fusion of channel and token information.Indeed, Mixer represents a paradigm for information extraction that amalgamateschannel and token information. The essence of Mixer lies in its ability toblend information from diverse perspectives, epitomizing the true concept of"mixing" in the realm of neural network architectures. Beyond channel and tokenconsiderations, it is possible to create more tailored mixers from variousperspectives to better suit specific task requirements. This study focuses onthe domain of audio recognition, introducing a novel model named AudioSpectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporatesinsights from both time and frequency domains. Experimental results demonstratethat ASM-RH is particularly well-suited for audio data and yields promisingoutcomes across multiple classification tasks. The models and optimal weightsfiles will be published.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-classification-on-ravdess | ASM-RH-A | Top-1 Accuracy: 75.4 |
| audio-classification-on-speech-commands-1 | ASM-RH | Accuracy: 96.51 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.