Command Palette
Search for a command to run...
León-Alcázar Juan ; Heilbron Fabian Caba ; Thabet Ali ; Ghanem Bernard

Abstract
Active speaker detection requires a solid integration of multi-modal cues.While individual modalities can approximate a solution, accurate predictionscan only be achieved by explicitly fusing the audio and visual features andmodeling their temporal progression. Despite its inherent muti-modal nature,current methods still focus on modeling and fusing short-term audiovisualfeatures for individual speakers, often at frame level. In this paper wepresent a novel approach to active speaker detection that directly addressesthe multi-modal nature of the problem, and provides a straightforward strategywhere independent visual features from potential speakers in the scene areassigned to a previously detected speech event. Our experiments show that, ansmall graph data structure built from a single frame, allows to approximate aninstantaneous audio-visual assignment problem. Moreover, the temporal extensionof this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeakerdataset with a mAP of 88.8\%.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-visual-active-speaker-detection-on-ava | MAAS-TAN | validation mean average precision: 88.8% |
| audio-visual-active-speaker-detection-on-ava | MAAS-LAN | validation mean average precision: 85.1% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.