HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Takaki Makino Hank Liao Yannis Assael Brendan Shillingford Basilio Garcia Otavio Braga Olivier Siohan

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Abstract

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-speech-recognition-on-lrs3-tedRNN-T
Word Error Rate (WER): 4.5
lipreading-on-lrs3-tedRNN-T
Word Error Rate (WER): 33.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition | Papers | HyperAI