3 months ago

Voice Conversion With Just Nearest Neighbors

Matthew Baas Benjamin van Niekerk Herman Kamper

Abstract

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc

Code Repositories

bshall/knn-vc

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
voice-conversion-on-librispeech-test-clean	kNN-VC (prematched HiFiGAN)	Character Error Rate (CER): 2.96 Equal Error Rate: 37.15 Word Error Rate (WER): 7.36

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette