Command Palette
Search for a command to run...
Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor
Younglo Lee Shukjae Choi Byeong-Yeol Kim Zhong-Qiu Wang Shinji Watanabe

Abstract
We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-separation-on-wsj0-2mix | SepTDA (L=12) | SI-SDRi: 24.0 |
| speech-separation-on-wsj0-3mix | SepTDA | SI-SDRi: 23.7 |
| speech-separation-on-wsj0-4mix | SepTDA | SI-SDRi: 22.0 |
| speech-separation-on-wsj0-5mix | SepTDA | SI-SDRi: 21.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.