6 months ago

Multimodal Representation

Action Recognition

Computer Vision

Petros Daras DIMITRIOS KONSTANTINIDIS Kosmas Dimitropoulos Ilias Papastratis

Abstract

Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classified using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

6 months ago

Multimodal Representation

Action Recognition

Computer Vision

Petros Daras DIMITRIOS KONSTANTINIDIS Kosmas Dimitropoulos Ilias Papastratis

Abstract

Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classified using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp