HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Jiang Hao ; Murdock Calvin ; Ithapu Vamsi Krishna

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Abstract

Augmented reality devices have the potential to enhance human perception andenable other assistive functionalities in complex conversational environments.Effectively capturing the audio-visual context necessary for understandingthese social interactions first requires detecting and localizing the voiceactivities of the device wearer and the surrounding people. These tasks arechallenging due to their egocentric nature: the wearer's head motion may causemotion blur, surrounding people may appear in difficult viewing angles, andthere may be occlusions, visual clutter, audio noise, and bad lighting. Underthese conditions, previous state-of-the-art active speaker detection methods donot give satisfactory results. Instead, we tackle the problem from a newsetting using both video and multi-channel microphone array audio. We propose anovel end-to-end deep learning approach that is able to give robust voiceactivity detection and localization results. In contrast to previous methods,our method localizes active speakers from all possible directions on thesphere, even outside the camera's field of view, while simultaneously detectingthe device wearer's own voice activity. Our experiments show that the proposedmethod gives superior results, can run in real time, and is robust againstnoise and clutter.

Benchmarks

BenchmarkMethodologyMetrics
active-speaker-localization-on-easycomAV (cor+eng+box)
ASL mAP: 0.8632

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization | Papers | HyperAI