The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition

Enshi Zhang, Rafael Trujillo, Christian Poellabauer


Abstract
Research in the field of speech emotion recognition (SER) relies on the availability of comprehensive datasets to make it possible to design accurate emotion detection models. This study introduces the Multimodal Emotion Recognition and Sentiment Analysis (MERSA) dataset, which includes both natural and scripted speech recordings, transcribed text, physiological data, and self-reported emotional surveys from 150 participants collected over a two-week period. This work also presents a novel emotion recognition approach that uses a transformer-based model, integrating pre-trained wav2vec 2.0 and BERT for feature extractions and additional LSTM layers to learn hidden representations from fused representations from speech and text. Our model predicts emotions on dimensions of arousal, valence, and dominance. We trained and evaluated the model on the MSP-PODCAST dataset and achieved competitive results from the best-performing model regarding the concordance correlation coefficient (CCC). Further, this paper demonstrates the effectiveness of this model through cross-domain evaluations on both IEMOCAP and MERSA datasets.
Anthology ID:
2024.acl-long.752
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13960–13970
Language:
URL:
https://aclanthology.org/2024.acl-long.752
DOI:
10.18653/v1/2024.acl-long.752
Bibkey:
Cite (ACL):
Enshi Zhang, Rafael Trujillo, and Christian Poellabauer. 2024. The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13960–13970, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition (Zhang et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.752.pdf
Video:
 https://preview.aclanthology.org/add_acl24_videos/2024.acl-long.752.mp4