Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs
Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long
Abstract
Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model’s overall factual accuracy.- Anthology ID:
- 2025.indonlp-1.6
- Volume:
- Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi
- Editors:
- Ruvan Weerasinghe, Isuri Anuradha, Deshan Sumanathilaka
- Venues:
- IndoNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 50–57
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.indonlp-1.6/
- DOI:
- Cite (ACL):
- Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, and Eileen Long. 2025. Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, pages 50–57, Abu Dhabi. Association for Computational Linguistics.
- Cite (Informal):
- Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs (Joshi et al., IndoNLP 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.indonlp-1.6.pdf