Nadeesha Chathurangi Naradde Vidana Pathirana
2025
Sinhala Encoder-only Language Models and Evaluation
Tharindu Ranasinghe
|
Hansi Hettiarachchi
|
Nadeesha Chathurangi Naradde Vidana Pathirana
|
Damith Premasiri
|
Lasitha Uyangodage
|
Isuri Nanomi Arachchige
|
Alistair Plum
|
Paul Rayson
|
Ruslan Mitkov
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.
Search
Fix author
Co-authors
- Hansi Hettiarachchi 1
- Ruslan Mitkov 1
- Isuri Nanomi Arachchige 1
- Alistair Plum 1
- Damith Premasiri 1
- show all...
Venues
- acl1