Efficient Continual Pre-training of LLMs for Low-resource Languages

Arijit Nag; Soumen Chakrabarti; Animesh Mukherjee; Niloy Ganguly

Efficient Continual Pre-training of LLMs for Low-resource Languages

Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, Niloy Ganguly

Abstract

Open-source large language models (Os-LLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost.To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability.For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.

Anthology ID:: 2025.naacl-industry.25
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 304–317
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.25/
DOI:
Bibkey:
Cite (ACL):: Arijit Nag, Soumen Chakrabarti, Animesh Mukherjee, and Niloy Ganguly. 2025. Efficient Continual Pre-training of LLMs for Low-resource Languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 304–317, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Efficient Continual Pre-training of LLMs for Low-resource Languages (Nag et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.naacl-industry.25.pdf

PDF Cite Search Fix data