8 months ago

Multimodal Representation

Video Understanding

Computer Vision

Kranti Kumar Parida Neeraj Matiyali Tanaya Guha Gaurav Sharma

Abstract

We present an audio-visual multimodal approach for the task of zeroshotlearning (ZSL) for classification and retrieval of videos. ZSL has been studiedextensively in the recent past but has primarily been limited to visualmodality and to images. We demonstrate that both audio and visual modalitiesare important for ZSL for videos. Since a dataset to study the task iscurrently not available, we also construct an appropriate multimodal datasetwith 33 classes containing 156,416 videos, from an existing large scale audioevent dataset. We empirically show that the performance improves by addingaudio modality for both tasks of zeroshot classification and retrieval, whenusing multimodal extensions of embedding learning methods. We also propose anovel method to predict the `dominant' modality using a jointly learnedmodality attention network. We learn the attention in a semi-supervised settingand thus do not require any additional explicit labelling for the modalities.We provide qualitative validation of the modality specific attention, whichalso successfully generalizes to unseen test classes.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Video Understanding

Computer Vision

Kranti Kumar Parida Neeraj Matiyali Tanaya Guha Gaurav Sharma

Abstract

We present an audio-visual multimodal approach for the task of zeroshotlearning (ZSL) for classification and retrieval of videos. ZSL has been studiedextensively in the recent past but has primarily been limited to visualmodality and to images. We demonstrate that both audio and visual modalitiesare important for ZSL for videos. Since a dataset to study the task iscurrently not available, we also construct an appropriate multimodal datasetwith 33 classes containing 156,416 videos, from an existing large scale audioevent dataset. We empirically show that the performance improves by addingaudio modality for both tasks of zeroshot classification and retrieval, whenusing multimodal extensions of embedding learning methods. We also propose anovel method to predict the `dominant' modality using a jointly learnedmodality attention network. We learn the attention in a semi-supervised settingand thus do not require any additional explicit labelling for the modalities.We provide qualitative validation of the modality specific attention, whichalso successfully generalizes to unseen test classes.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp