Oduguwa Damilola John


2025

pdf bib
SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining
Oduguwa Damilola John | Jeffrey Otoibhi | David Okpare
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

The rapid advancement of large language models (LLMs) has revolutionized natural language processing, yet a significant challenge persists: the under representation of low-resource languages. This paper introduces SabiYarn, a novel 125M parameter decoder-only language model specifically designed to address this gap for Nigerian languages.Our research demonstrates that a relatively small language model can achieve remarkable performance across multiple languages even in a low-resource setting when trained on carefully curated task-specific datasets. We introduce a multitask learning framework designed for computational efficiency, leveraging techniques such as sequence packing to maximize token throughput per batch. This allows SabiYarn to make the most of a limited compute budget while achieving strong performance across multiple NLP tasks.This paper not only highlights the effectiveness of our approach but also challenges the notion that only massive models can achieve high performance in diverse linguistic contexts, outperforming models over 100 times its parameter size on specific tasks such as translation (in both directions), Named Entity Recognition, Text Diacritization, and Sentiment Analysis in the low-resource languages it was trained on. SabiYarn-125M represents a significant step towards democratizing NLP technologies for low-resource languages, offering a blueprint for developing efficient, high-performing models tailored to specific linguistic regions. Our work paves the way for more inclusive and culturally sensitive AI systems, potentially transforming how language technologies are developed and deployed in linguistically diverse areas like Nigeria and beyond.