HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Combining deep and unsupervised features for multilingual speech emotion recognition

{Roberto Tedesco Licia Sbattella Federico Galati Vincenzo Scotti}

Abstract

In this paper we present a Convolutional Neural Network for multilingual emotion recognition from spoken sentences. The purpose of this work was to build a model capable of recognising emotions combining textual and acoustic information compatible with multiple languages. The model we derive has an end-to-end deep architecture, hence it takes raw text and audio data and uses convolutional layers to extract a hierarchy of classification features. Moreover, we show how the trained model achieves good performances in different languages thanks to the usage of multilingual unsupervised textual features. As an additional remark, it is worth to mention that our solution does not require text and audio to be word- or phoneme-aligned. The proposed model, PATHOSnet, was trained and evaluated on multiple corpora with different spoken languages (IEMOCAP, EmoFilm, SES and AESI). Before training, we tuned the hyper-parameters solely on the IEMOCAP corpus, which offers realistic audio recording and transcription of sentences with emotional content in English. The final model turned out to provide state-of-the-art performances on some of the selected data sets on the four considered emotions.

Benchmarks

BenchmarkMethodologyMetrics
multimodal-emotion-recognition-on-iemocap-4PATHOSnet v2
Accuracy: 80.4
F1: 78

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Combining deep and unsupervised features for multilingual speech emotion recognition | Papers | HyperAI