Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Gorka Urbizu; Ander Corral; Xabier Saralegi; Iñaki San Vicente

Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Gorka Urbizu, Ander Corral, Xabier Saralegi, Iñaki San Vicente

Abstract

This work investigates the effectiveness of small autoregressive language models (SLMs) with up to one billion parameters (sub-1B) for natural language processing (NLP) tasks in low-resource languages, focusing on Basque. We analyze optimal training strategies by comparing training from scratch and continual pre-training using state-of-the-art SLM architectures. Our analysis considers factors such as model size and the extent of Basque presence in the pre-training corpus. To assess linguistic capabilities, models are evaluated on 12 NLP tasks using the Harness framework. We also conduct a manual evaluation of fine-tuned models on three downstream natural language generation (NLG) tasks: question answering (QA), summarization, and machine translation (MT). Our findings indicate that continual pre-training on a multilingual SLM substantially enhances linguistic performance compared to training from scratch, particularly in low-resource language settings where available corpora typically contain fewer than one billion words. Additionally, the presence of Basque during the pre-training and larger model sizes contribute positively to performance in NLG tasks.

Anthology ID:: 2025.mrl-main.35
Volume:: Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Month:: November
Year:: 2025
Address:: Suzhuo, China
Editors:: David Ifeoluwa Adelani, Catherine Arnett, Duygu Ataman, Tyler A. Chang, Hila Gonen, Rahul Raja, Fabian Schmidt, David Stap, Jiayi Wang
Venues:: MRL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 519–530
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.35/
DOI:
Bibkey:
Cite (ACL):: Gorka Urbizu, Ander Corral, Xabier Saralegi, and Iñaki San Vicente. 2025. Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 519–530, Suzhuo, China. Association for Computational Linguistics.
Cite (Informal):: Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque (Urbizu et al., MRL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.35.pdf

PDF Cite Search Fix data