3 months ago

Bts-e: Audio deepfake detection using breathing-talking-silence encoder

{Kihun Hong Souhwan Jung Long Nguyen-Vu Thien-Phuc Doan}

Abstract

Voice phishing (vishing) is increasingly popular due to the development of speech synthesis technology. In particular, the use of deep learning to generate an arbitrary-content audio clip simulating the victim’s voice makes it difficult not only for humans but also for automatic speaker verification (ASV) systems to distinguish. Countermeasure (CM) systems have been developed recently to help ASV combat synthetic speech. In this work, we propose BTS-E, a framework to evaluate the correlation between Breathing, Talking (speech), and Silence sounds in an audio clip, then use this information for deepfake detection tasks. We argue that natural human sounds, such as breathing, are hard to synthesize by Text-to-speech (TTS) system. We conducted a large-scale evaluation using ASVspoof 2019 and 2021 evaluation set to validate our hypothesis. The experiment results show the applicability of the breathing sound feature in detecting deepfake voices. In general, the proposed system significantly increases the performance of the classifier by up to 46%.

Benchmarks

Benchmark	Methodology	Metrics
audio-deepfake-detection-on-asvspoof-2021	BTS-E	21DF EER: / 21LA EER: 8.75

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning