Command Palette
Search for a command to run...
Alcazar Juan Leon ; Heilbron Fabian Caba ; Mai Long ; Perazzi Federico ; Lee Joon-Young ; Arbelaez Pablo ; Ghanem Bernard

Abstract
Current methods for active speak er detection focus on modeling short-termaudiovisual information from a single speaker. Although this strategy can beenough for addressing single-speaker scenarios, it prevents accurate detectionwhen the task is to identify who of many candidate speakers are talking. Thispaper introduces the Active Speaker Context, a novel representation that modelsrelationships between multiple speakers over long time horizons. Our ActiveSpeaker Context is designed to learn pairwise and temporal relations from anstructured ensemble of audio-visual observations. Our experiments show that astructured feature ensemble already benefits the active speaker detectionperformance. Moreover, we find that the proposed Active Speaker Contextimproves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAPof 87.1%. We present ablation studies that verify that this result is a directconsequence of our long-term multi-speaker analysis.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-visual-active-speaker-detection-on-ava | Active Speakers in Context | validation mean average precision: 87.1% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.