Abstract
Social media has become a crucial open-access platform enabling individuals to freely express opinions and share experiences. These platforms contain user-generated content facilitating instantaneous communication and feedback. However, leveraging low-resource language data from Twitter can be challenging due to the scarcity and poor quality of content with significant variations in language use, such as slang and code-switching. Automatically identifying tweets in low-resource languages can also be challenging because Twitter primarily supports high-resource languages; low-resource languages often lack robust linguistic and contextual support. This paper analyzes Kenyan code-switched data from Twitter using four transformer-based pretrained models for sentiment and emotion classification tasks using supervised and semi-supervised methods. We detail the methodology behind data collection, the annotation procedure, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2%) and F1 score (66.1%), XLM-R semi-supervised (67.2% accuracy, 64.1% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8%) and F1 score (31%), mBERT semi-supervised (accuracy (59% and F1 score 26.5%). AfriBERTa models show the lowest accuracy and F1 scores. This indicates that the semi-supervised method’s performance is constrained by the small labeled dataset.- Anthology ID:
- 2024.wassa-1.19
- Volume:
- Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Orphée De Clercq, Valentin Barriere, Jeremy Barnes, Roman Klinger, João Sedoc, Shabnam Tafreshi
- Venues:
- WASSA | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 234–249
- Language:
- URL:
- https://aclanthology.org/2024.wassa-1.19
- DOI:
- Cite (ACL):
- Naome Etori and Maria Gini. 2024. RideKE: Leveraging Low-resource Twitter User-generated Content for Sentiment and Emotion Detection on Code-switched RHS Dataset.. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 234–249, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- RideKE: Leveraging Low-resource Twitter User-generated Content for Sentiment and Emotion Detection on Code-switched RHS Dataset. (Etori & Gini, WASSA-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.wassa-1.19.pdf