Nanda Family: Open-Weights Generative Large Language Models for Hindi

Aaryamonvikram Singh, Debopriyo Banerjee, Dhruv Sahnan, Monojit Choudhury, Shivam Chauhan, Rocktim Jyoti Das, Xudong Han, Haonan Li, Alok Anil Jadhav, Utkarsh Agarwal, Mukund Choudhary, Fajri Koto, Junaid Hamid Bhat, Awantika Shukla, Samujjwal Ghosh, Samta Kamboj, Onkar Pandit, Lalit Pradhan, Rahul Pal, Sunil Kumar Sahu, Parvez Mullah, Ali El Filali, Zainul Abedien Ahmed Quraishi, Neha Sengupta, Gokulakrishnan Ramakrishnan, Rituraj Joshi, Gurpreet Gosal, Avraham Sheinin, Natalia Vassilieva, Preslav Nakov


Abstract
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
Anthology ID:
2026.eacl-long.288
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6086–6108
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.288/
DOI:
Bibkey:
Cite (ACL):
Aaryamonvikram Singh, Debopriyo Banerjee, Dhruv Sahnan, Monojit Choudhury, Shivam Chauhan, Rocktim Jyoti Das, Xudong Han, Haonan Li, Alok Anil Jadhav, Utkarsh Agarwal, Mukund Choudhary, Fajri Koto, Junaid Hamid Bhat, Awantika Shukla, Samujjwal Ghosh, Samta Kamboj, Onkar Pandit, Lalit Pradhan, Rahul Pal, Sunil Kumar Sahu, Parvez Mullah, Ali El Filali, Zainul Abedien Ahmed Quraishi, Neha Sengupta, Gokulakrishnan Ramakrishnan, Rituraj Joshi, Gurpreet Gosal, Avraham Sheinin, Natalia Vassilieva, and Preslav Nakov. 2026. Nanda Family: Open-Weights Generative Large Language Models for Hindi. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6086–6108, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Nanda Family: Open-Weights Generative Large Language Models for Hindi (Singh et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.288.pdf