CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French
AmirAli Bagher Zadeh, Yansheng Cao, Simon Hessner, Paul Pu Liang, Soujanya Poria, Louis-Philippe Morency
Abstract
Modeling multimodal language is a core research area in natural language processing. While languages such as English have relatively large multimodal language resources, other widely spoken languages across the globe have few or no large-scale datasets in this area. This disproportionately affects native speakers of languages other than English. As a step towards building more equitable and inclusive multimodal systems, we introduce the first large-scale multimodal language dataset for Spanish, Portuguese, German and French. The proposed dataset, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is the largest of its kind with 40,000 total labelled sentences. It covers a diverse set topics and speakers, and carries supervision of 20 labels including sentiment (and subjectivity), emotions, and attributes. Our evaluations on a state-of-the-art multimodal model demonstrates that CMU-MOSEAS enables further research for multilingual studies in multimodal language.- Anthology ID:
- 2020.emnlp-main.141
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1801–1812
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.141
- DOI:
- 10.18653/v1/2020.emnlp-main.141
- Cite (ACL):
- AmirAli Bagher Zadeh, Yansheng Cao, Simon Hessner, Paul Pu Liang, Soujanya Poria, and Louis-Philippe Morency. 2020. CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1801–1812, Online. Association for Computational Linguistics.
- Cite (Informal):
- CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French (Bagher Zadeh et al., EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/landing_page/2020.emnlp-main.141.pdf