A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Bharathi Raja Chakravarthi; Navya Jose; Shardul Suryawanshi; Elizabeth Sherly; John Philip McCrae

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, John Philip McCrae

[How to correct problems with metadata yourself]

Abstract

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

Anthology ID:: 2020.sltu-1.25
Volume:: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:: SLTU
SIG:
Publisher:: European Language Resources association
Note:
Pages:: 177–184
Language:: English
URL:: https://aclanthology.org/2020.sltu-1.25
DOI:
Bibkey:
Cite (ACL):: Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John Philip McCrae. 2020. A Sentiment Analysis Dataset for Code-Mixed Malayalam-English. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 177–184, Marseille, France. European Language Resources association.
Cite (Informal):: A Sentiment Analysis Dataset for Code-Mixed Malayalam-English (Chakravarthi et al., SLTU 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/teach-a-man-to-fish/2020.sltu-1.25.pdf
Code: bharathichezhiyan/MalayalamMixSentiment
Data: MalayalamMixSentiment

PDF Search Code