BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context
Alexis Matzopoulos, Charl Hendriks, Hishaam Mahomed, Francois Meyer
Abstract
The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, outperforming models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.- Anthology ID:
- 2025.loreslm-1.19
- Volume:
- Proceedings of the First Workshop on Language Models for Low-Resource Languages
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
- Venues:
- LoResLM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 240–248
- Language:
- URL:
- https://preview.aclanthology.org/ingest_wac_2008/2025.loreslm-1.19/
- DOI:
- Cite (ACL):
- Alexis Matzopoulos, Charl Hendriks, Hishaam Mahomed, and Francois Meyer. 2025. BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 240–248, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context (Matzopoulos et al., LoResLM 2025)
- PDF:
- https://preview.aclanthology.org/ingest_wac_2008/2025.loreslm-1.19.pdf