Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs

Raviraj Joshi; Kanishk Singla; Anusha Kamath; Raunak Kalani; Rakesh Paul; Utkarsh Vaidya; Sanjay Singh Chauhan; Niranjan Wartikar; Eileen Long

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs

Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long

Abstract

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model’s overall factual accuracy.

Anthology ID:: 2025.indonlp-1.6
Volume:: Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Month:: January
Year:: 2025
Address:: Abu Dhabi
Editors:: Ruvan Weerasinghe, Isuri Anuradha, Deshan Sumanathilaka
Venues:: IndoNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 50–57
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.indonlp-1.6/
DOI:
Bibkey:
Cite (ACL):: Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, and Eileen Long. 2025. Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, pages 50–57, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):: Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs (Joshi et al., IndoNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.indonlp-1.6.pdf

PDF Cite Search Fix data