5 months ago

Real-Time Target Sound Extraction

Veluri Bandhav ; Chan Justin ; Itani Malek ; Chen Tuochao ; Yoshioka Takuya ; Gollakota Shyamnath

Abstract

We present the first neural network model to achieve real-time and streamingtarget sound extraction. To accomplish this, we propose Waveformer, anencoder-decoder architecture with a stack of dilated causal convolution layersas the encoder, and a transformer decoder layer as the decoder. This hybridarchitecture uses dilated causal convolutions for processing large receptivefields in a computationally efficient manner while also leveraging thegeneralization performance of transformer-based architectures. Our evaluationsshow as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior modelsfor this task while having a 1.2-4x smaller model size and a 1.5-2x lowerruntime. We provide code, dataset, and audio samples:https://waveformer.cs.washington.edu/.

Code Repositories

vb000/waveformer

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
streaming-target-sound-extraction-on	Waveformer	SI-SNRi: 9.43
target-sound-extraction-on-fsdsoundscapes	Waveformer	SI-SNRi: 9.43

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette