ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Noé Tits, Kevin El Haddad, Thierry Dutoit


Abstract
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text mapping learned by the ASR system contains information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal.
Anthology ID:
W18-3307
Volume:
Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–52
Language:
URL:
https://aclanthology.org/W18-3307
DOI:
10.18653/v1/W18-3307
Bibkey:
Cite (ACL):
Noé Tits, Kevin El Haddad, and Thierry Dutoit. 2018. ASR-based Features for Emotion Recognition: A Transfer Learning Approach. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 48–52, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
ASR-based Features for Emotion Recognition: A Transfer Learning Approach (Tits et al., ACL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W18-3307.pdf
Data
IEMOCAP