HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

Parida Kranti Kumar ; Matiyali Neeraj ; Guha Tanaya ; Sharma Gaurav

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual
  Zeroshot Classification and Retrieval of Videos

Abstract

We present an audio-visual multimodal approach for the task of zeroshotlearning (ZSL) for classification and retrieval of videos. ZSL has been studiedextensively in the recent past but has primarily been limited to visualmodality and to images. We demonstrate that both audio and visual modalitiesare important for ZSL for videos. Since a dataset to study the task iscurrently not available, we also construct an appropriate multimodal datasetwith 33 classes containing 156,416 videos, from an existing large scale audioevent dataset. We empirically show that the performance improves by addingaudio modality for both tasks of zeroshot classification and retrieval, whenusing multimodal extensions of embedding learning methods. We also propose anovel method to predict the `dominant' modality using a jointly learnedmodality attention network. We learn the attention in a semi-supervised settingand thus do not require any additional explicit labelling for the modalities.We provide qualitative validation of the modality specific attention, whichalso successfully generalizes to unseen test classes.

Benchmarks

BenchmarkMethodologyMetrics
gzsl-video-classification-on-activitynet-gzsl-1CJME
HM: 5.12
ZSL: 5.84
gzsl-video-classification-on-ucf-gzsl-mainCJME
HM: 12.48
ZSL: 8.29
gzsl-video-classification-on-vggsound-gzsl-1CJME
HM: 6.17
ZSL: 5.16

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos | Papers | HyperAI