Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Shanilka Haturusinghe; Tharindu Cyril Weerasooriya; Christopher M. Homan; Marcos Zampieri; Sidath Ravindra Liyanage

Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala

Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Christopher M Homan, Marcos Zampieri, Sidath Ravindra Liyanage

Abstract

Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: “Subasa-XLM-R”, which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of “Subasa-Llama” and “Subasa-Mistral”, are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

Anthology ID:: 2025.naacl-srw.26
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:: April
Year:: 2025
Address:: Albuquerque, USA
Editors:: Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
Venues:: NAACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 260–270
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.26/
DOI:
Bibkey:
Cite (ACL):: Shanilka Haturusinghe, Tharindu Cyril Weerasooriya, Christopher M Homan, Marcos Zampieri, and Sidath Ravindra Liyanage. 2025. Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 260–270, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):: Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala (Haturusinghe et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.26.pdf

PDF Cite Search Fix data