5 months ago

MAAS: Multi-modal Assignation for Active Speaker Detection

León-Alcázar Juan ; Heilbron Fabian Caba ; Thabet Ali ; Ghanem Bernard

Abstract

Active speaker detection requires a solid integration of multi-modal cues.While individual modalities can approximate a solution, accurate predictionscan only be achieved by explicitly fusing the audio and visual features andmodeling their temporal progression. Despite its inherent muti-modal nature,current methods still focus on modeling and fusing short-term audiovisualfeatures for individual speakers, often at frame level. In this paper wepresent a novel approach to active speaker detection that directly addressesthe multi-modal nature of the problem, and provides a straightforward strategywhere independent visual features from potential speakers in the scene areassigned to a previously detected speech event. Our experiments show that, ansmall graph data structure built from a single frame, allows to approximate aninstantaneous audio-visual assignment problem. Moreover, the temporal extensionof this initial graph achieves a new state-of-the-art on the AVA-ActiveSpeakerdataset with a mAP of 88.8\%.

Code Repositories

fuankarion/maas

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
audio-visual-active-speaker-detection-on-ava	MAAS-TAN	validation mean average precision: 88.8%
audio-visual-active-speaker-detection-on-ava	MAAS-LAN	validation mean average precision: 85.1%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette