Samujjwal Ghosh
2026
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
Search
Fix author
Co-authors
- Utkarsh Agarwal 1
- Debopriyo Banerjee 1
- Junaid Hamid Bhat 1
- Shivam Chauhan 1
- Mukund Choudhary 1
- Monojit Choudhury 1
- Rocktim Jyoti Das 1
- Ali El Filali 1
- Gurpreet Gosal 1
- Xudong Han 1
- Alok Anil Jadhav 1
- Rituraj Joshi 1
- Samta Kamboj 1
- Fajri Koto 1
- Haonan Li 1
- Parvez Mullah 1
- Preslav Nakov 1
- Rahul Pal 1
- Onkar Arun Pandit 1
- Lalit Pradhan 1
- Zainul Abedien Ahmed Quraishi 1
- Gokulakrishnan Ramakrishnan 1
- Dhruv Sahnan 1
- Sunil Kumar Sahu 1
- Neha Sengupta 1
- Avraham Sheinin 1
- Awantika Shukla 1
- Aaryamonvikram Singh 1
- Natalia Vassilieva 1
Venues
- EACL1