Command Palette
Search for a command to run...
Veluri Bandhav ; Chan Justin ; Itani Malek ; Chen Tuochao ; Yoshioka Takuya ; Gollakota Shyamnath

Abstract
We present the first neural network model to achieve real-time and streamingtarget sound extraction. To accomplish this, we propose Waveformer, anencoder-decoder architecture with a stack of dilated causal convolution layersas the encoder, and a transformer decoder layer as the decoder. This hybridarchitecture uses dilated causal convolutions for processing large receptivefields in a computationally efficient manner while also leveraging thegeneralization performance of transformer-based architectures. Our evaluationsshow as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior modelsfor this task while having a 1.2-4x smaller model size and a 1.5-2x lowerruntime. We provide code, dataset, and audio samples:https://waveformer.cs.washington.edu/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| streaming-target-sound-extraction-on | Waveformer | SI-SNRi: 9.43 |
| target-sound-extraction-on-fsdsoundscapes | Waveformer | SI-SNRi: 9.43 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.