8 months ago

Abstract

In this paper, we propose a novel approach for generalized zero-shot learningin a multi-modal setting, where we have novel classes of audio/video duringtesting that are not seen during training. We use the semantic relatedness oftext embeddings as a means for zero-shot learning by aligning audio and videoembeddings with the corresponding class label text feature space. Our approachuses a cross-modal decoder and a composite triplet loss. The cross-modaldecoder enforces a constraint that the class label text features can bereconstructed from the audio and video embeddings of data points. This helpsthe audio and video embeddings to move closer to the class label textembedding. The composite triplet loss makes use of the audio, video, and textembeddings. It helps bring the embeddings from the same class closer and pushaway the embeddings from different classes in a multi-modal setting. This helpsthe network to perform better on the multi-modal zero-shot learning task.Importantly, our multi-modal zero-shot learning approach works even if amodality is missing at test time. We test our approach on the generalizedzero-shot classification and retrieval tasks and show that our approachoutperforms other models in the presence of a single modality as well as in thepresence of multiple modalities. We validate our approach by comparing it withprevious approaches and using various ablations.

Source PDF