Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Christopher M Homan, Marcos Zampieri, Sidath Ravindra Liyanage
Abstract
Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: “Subasa-XLM-R”, which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of “Subasa-Llama” and “Subasa-Mistral”, are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.- Anthology ID:
- 2025.naacl-srw.26
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, USA
- Editors:
- Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
- Venues:
- NAACL | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 260–270
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.26/
- DOI:
- Cite (ACL):
- Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Christopher M Homan, Marcos Zampieri, and Sidath Ravindra Liyanage. 2025. Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 260–270, Albuquerque, USA. Association for Computational Linguistics.
- Cite (Informal):
- Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala (Haturusinghe et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.26.pdf