SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining

Oduguwa Damilola John, Jeffrey Otoibhi, David Okpare


Abstract
The rapid advancement of large language models (LLMs) has revolutionized natural language processing, yet a significant challenge persists: the under representation of low-resource languages. This paper introduces SabiYarn, a novel 125M parameter decoder-only language model specifically designed to address this gap for Nigerian languages.Our research demonstrates that a relatively small language model can achieve remarkable performance across multiple languages even in a low-resource setting when trained on carefully curated task-specific datasets. We introduce a multitask learning framework designed for computational efficiency, leveraging techniques such as sequence packing to maximize token throughput per batch. This allows SabiYarn to make the most of a limited compute budget while achieving strong performance across multiple NLP tasks.This paper not only highlights the effectiveness of our approach but also challenges the notion that only massive models can achieve high performance in diverse linguistic contexts, outperforming models over 100 times its parameter size on specific tasks such as translation (in both directions), Named Entity Recognition, Text Diacritization, and Sentiment Analysis in the low-resource languages it was trained on. SabiYarn-125M represents a significant step towards democratizing NLP technologies for low-resource languages, offering a blueprint for developing efficient, high-performing models tailored to specific linguistic regions. Our work paves the way for more inclusive and culturally sensitive AI systems, potentially transforming how language technologies are developed and deployed in linguistically diverse areas like Nigeria and beyond.
Anthology ID:
2025.africanlp-1.14
Volume:
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Constantine Lignos, Idris Abdulmumin, David Adelani
Venues:
AfricaNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
95–107
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.africanlp-1.14/
DOI:
Bibkey:
Cite (ACL):
Oduguwa Damilola John, Jeffrey Otoibhi, and David Okpare. 2025. SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining. In Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025), pages 95–107, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining (John et al., AfricaNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.africanlp-1.14.pdf