HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multi-Task Learning for Audio Visual Active Speaker Detection

{Shiguang Shan Shuang Yang Jingyun Xiao Yuanhang Zhang}

Multi-Task Learning for Audio Visual Active Speaker Detection

Abstract

This report describes the approach underlying our submission to the active speaker detection task (task B-2) of ActivityNet Challenge 2019. We introduce a new audio-visual model which builds upon a 3D-ResNet18 visual model pretrained for lipreading and a VGG-M acoustic model pretrained for audio-to-video synchronization. The model is trained with two losses in a multi-task learning fashion: a contrastive loss to enforce matching between audio and video features for active speakers, and a regular crossentropy loss to obtain speaker / non-speaker labels. This model obtains 84.0% mAP on the validation set of AVAActiveSpeaker. Experimental results showcase the pretrained embeddings' abilities to transfer across tasks and data formats, as well as the advantage of the proposed multi-task learning strategy.

Benchmarks

BenchmarkMethodologyMetrics
audio-visual-active-speaker-detection-on-ava3D-ResNet-GRU
validation mean average precision: 84.0%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multi-Task Learning for Audio Visual Active Speaker Detection | Papers | HyperAI