HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

Ahn Young Jin ; Park Jungwoo ; Park Sangha ; Choi Jonghyun ; Kim Kee-Eung

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End
  Crossmodal Audio Token Synchronization

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer visionand speech recognition, aiming to interpret spoken content from visual cues. Aprominent challenge in VSR is the presence of homophenes-visually similar lipgestures that represent different phonemes. Prior approaches have sought todistinguish fine-grained visemes by aligning visual and auditory semantics, butoften fell short of full synchronization. To address this, we present SyncVSR,an end-to-end learning framework that leverages quantized audio for frame-levelcrossmodal supervision. By integrating a projection layer that synchronizesvisual representation with acoustic data, our encoder learns to generatediscrete audio tokens from a video sequence in a non-autoregressive manner.SyncVSR shows versatility across tasks, languages, and modalities at the costof a forward pass. Our empirical evaluations show that it not only achievesstate-of-the-art results but also reduces data usage by up to ninefold.

Code Repositories

KAIST-AILab/SyncVSR
Official
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
landmark-based-lipreading-on-lrs2SyncVSR
Word Error Rate (WER): 74.6
landmark-based-lipreading-on-lrwSyncVSR (Word Boundary)
Top 1 Accuracy: 80.3
landmark-based-lipreading-on-lrwSyncVSR
Top 1 Accuracy: 75.1
lipreading-on-lip-reading-in-the-wildSyncVSR (Word Boundary)
Top-1 Accuracy: 95.0
lipreading-on-lip-reading-in-the-wildSyncVSR
Top-1 Accuracy: 93.2
lipreading-on-lrs2SyncVSR
Word Error Rate (WER): 28.9
lipreading-on-lrs2SyncVSR
Word Error Rate (WER): 16.5
lipreading-on-lrs3-tedSyncVSR
Word Error Rate (WER): 31.2
lipreading-on-lrs3-tedSyncVSR
Word Error Rate (WER): 21.5
lipreading-on-lrw-1000SyncVSR (Word Boundary)
Top-1 Accuracy: 58.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp