IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation

Saurabh Kumar; Ranbir Sanasam; Sukumar Nandi

doi:10.18653/v1/2024.naacl-long.425

IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation

Saurabh Kumar, Ranbir Sanasam, Sukumar Nandi

Abstract

Sentiment analysis, a fundamental aspect of Natural Language Processing (NLP), involves the classification of emotions, opinions, and attitudes in text data. In the context of India, with its vast linguistic diversity and low-resource languages, the challenge is to support sentiment analysis in numerous Indian languages. This study explores the use of machine translation to bridge this gap. The investigation examines the feasibility of machine translation for creating sentiment analysis datasets in 22 Indian languages. Google Translate, with its extensive language support, is employed for this purpose in translating the Sentiment140 dataset. The study aims to provide insights into the practicality of using machine translation in the context of India’s linguistic diversity for sentiment analysis datasets. Our findings indicate that a dataset generated using Google Translate has the potential to serve as a foundational framework for tackling the low-resource challenges commonly encountered in sentiment analysis for Indian languages.

Anthology ID:: 2024.naacl-long.425
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7689–7698
Language:
URL:: https://aclanthology.org/2024.naacl-long.425
DOI:: 10.18653/v1/2024.naacl-long.425
Bibkey:
Cite (ACL):: Saurabh Kumar, Ranbir Sanasam, and Sukumar Nandi. 2024. IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7689–7698, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation (Kumar et al., NAACL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-4/2024.naacl-long.425.pdf

PDF Search