A Sentiment Corpus for South African Under-Resourced Languages in a Multilingual Context

Ronny Mabokela, Tim Schlippe


Abstract
Multilingual sentiment analysis is a process of detecting and classifying sentiment based on textual information written in multiple languages. There has been tremendous research advancement on high-resourced languages such as English. However, progress on under-resourced languages remains underrepresented with limited opportunities for further development of natural language processing (NLP) technologies. Sentiment analysis (SA) for under-resourced language still is a skewed research area. Although, there are some considerable efforts in emerging African countries to develop such resources for under-resourced languages, languages such as indigenous South African languages still suffer from a lack of datasets. To the best of our knowledge, there is currently no dataset dedicated to SA research for South African languages in a multilingual context, i.e. comments are in different languages and may contain code-switching. In this paper, we present the first subset of the multilingual sentiment corpus SAfriSenti for the three most widely spoken languages in South Africa—English, Sepedi (i.e. Northern Sotho), and Setswana. This subset consists of over 40,000 annotated tweets in all the three languages including even 36.6% of code-switched texts. We present data collection, cleaning and annotation strategies that were followed to curate the dataset for these languages. Furthermore, we describe how we developed language-specific sentiment lexicons, morpheme-based sentiment taggers, conduct linguistic analyses and present possible solutions for the challenges of this sentiment dataset. We will release the dataset and sentiment lexicons to the research communities to advance the NLP research of under-resourced languages.
Anthology ID:
2022.sigul-1.9
Volume:
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venue:
SIGUL
SIG:
SIGUL
Publisher:
European Language Resources Association
Note:
Pages:
70–77
Language:
URL:
https://aclanthology.org/2022.sigul-1.9
DOI:
Bibkey:
Cite (ACL):
Ronny Mabokela and Tim Schlippe. 2022. A Sentiment Corpus for South African Under-Resourced Languages in a Multilingual Context. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 70–77, Marseille, France. European Language Resources Association.
Cite (Informal):
A Sentiment Corpus for South African Under-Resourced Languages in a Multilingual Context (Mabokela & Schlippe, SIGUL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2022.sigul-1.9.pdf