Command Palette
Search for a command to run...
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Ahn Young Jin ; Park Jungwoo ; Park Sangha ; Choi Jonghyun ; Kim Kee-Eung

Abstract
Visual Speech Recognition (VSR) stands at the intersection of computer visionand speech recognition, aiming to interpret spoken content from visual cues. Aprominent challenge in VSR is the presence of homophenes-visually similar lipgestures that represent different phonemes. Prior approaches have sought todistinguish fine-grained visemes by aligning visual and auditory semantics, butoften fell short of full synchronization. To address this, we present SyncVSR,an end-to-end learning framework that leverages quantized audio for frame-levelcrossmodal supervision. By integrating a projection layer that synchronizesvisual representation with acoustic data, our encoder learns to generatediscrete audio tokens from a video sequence in a non-autoregressive manner.SyncVSR shows versatility across tasks, languages, and modalities at the costof a forward pass. Our empirical evaluations show that it not only achievesstate-of-the-art results but also reduces data usage by up to ninefold.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| landmark-based-lipreading-on-lrs2 | SyncVSR | Word Error Rate (WER): 74.6 |
| landmark-based-lipreading-on-lrw | SyncVSR (Word Boundary) | Top 1 Accuracy: 80.3 |
| landmark-based-lipreading-on-lrw | SyncVSR | Top 1 Accuracy: 75.1 |
| lipreading-on-lip-reading-in-the-wild | SyncVSR (Word Boundary) | Top-1 Accuracy: 95.0 |
| lipreading-on-lip-reading-in-the-wild | SyncVSR | Top-1 Accuracy: 93.2 |
| lipreading-on-lrs2 | SyncVSR | Word Error Rate (WER): 28.9 |
| lipreading-on-lrs2 | SyncVSR | Word Error Rate (WER): 16.5 |
| lipreading-on-lrs3-ted | SyncVSR | Word Error Rate (WER): 31.2 |
| lipreading-on-lrs3-ted | SyncVSR | Word Error Rate (WER): 21.5 |
| lipreading-on-lrw-1000 | SyncVSR (Word Boundary) | Top-1 Accuracy: 58.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.