8 months ago

Abstract

In light of the recent breakthroughs in automatic machine translationsystems, we propose a novel approach that we term as "Face-to-FaceTranslation". As today's digital communication becomes increasingly visual, weargue that there is a need for systems that can automatically translate a videoof a person speaking in language A into a target language B with realistic lipsynchronization. In this work, we create an automatic pipeline for this problemand demonstrate its impact on multiple real-world applications. First, we builda working speech-to-speech translation system by bringing together multipleexisting modules from speech and language. We then move towards "Face-to-FaceTranslation" by incorporating a novel visual module, LipGAN for generatingrealistic talking faces from the translated audio. Quantitative evaluation ofLipGAN on the standard LRW test set shows that it significantly outperformsexisting approaches across all standard metrics. We also subject ourFace-to-Face Translation pipeline, to multiple human evaluations and show thatit can significantly improve the overall user experience for consuming andinteracting with multimodal content across languages. Code, models and demovideo are made publicly available. Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0 Code and models: https://github.com/Rudrabha/LipGAN

Source PDF View Code