Mukund Choudhary
2026
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing
Mukund Choudhary | Madhur Jindal | Gaurja Aeron | Monojit Choudhury
Findings of the Association for Computational Linguistics: EACL 2026
Mukund Choudhary | Madhur Jindal | Gaurja Aeron | Monojit Choudhury
Findings of the Association for Computational Linguistics: EACL 2026
Do large language models (LLMs) model linguistic variation? We investigate this question through Hindi-English (Hinglish) verb code-mixing, where speakers can use either a Hindi verb or an English verb with the light verb karna (’do’). Both forms are grammatical, but speakers show unexplained variation in language choice for the verb. We compare human preferences on controlled code-mixed minimal pairs to LLM perplexities spanning families, sizes, and training language compositions. We find that current LLMs do not reliably classify verb language preferences to match native speaker judgments. We also see that with specific supervision, some models do predict human preference to an extent. We release native speaker acceptability judgments on 30 verb pairs, perplexity ratios for 4,279 verb pairs across 7 models, and experimental materials.
2023
CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus
Nikhil E | Mukund Choudhary | Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
Nikhil E | Mukund Choudhary | Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh Dhole | Varun Gangal | Sebastian Gehrmann | Aadesh Gupta | Zhenhao Li | Saad Mahamood | Abinaya Mahadiran | Simon Mille | Ashish Shrivastava | Samson Tan | Tongshang Wu | Jascha Sohl-Dickstein | Jinho Choi | Eduard Hovy | Ondřej Dušek | Sebastian Ruder | Sajant Anand | Nagender Aneja | Rabin Banjade | Lisa Barthe | Hanna Behnke | Ian Berlot-Attwell | Connor Boyle | Caroline Brun | Marco Antonio Sobrevilla Cabezudo | Samuel Cahyawijaya | Emile Chapuis | Wanxiang Che | Mukund Choudhary | Christian Clauss | Pierre Colombo | Filip Cornell | Gautier Dagan | Mayukh Das | Tanay Dixit | Thomas Dopierre | Paul-Alexis Dray | Suchitra Dubey | Tatiana Ekeinhor | Marco Di Giovanni | Tanya Goyal | Rishabh Gupta | Louanes Hamla | Sang Han | Fabrice Harel-Canada | Antoine Honoré | Ishan Jindal | Przemysław Joniak | Denis Kleyko | Venelin Kovatchev | Kalpesh Krishna | Ashutosh Kumar | Stefan Langer | Seungjae Ryan Lee | Corey James Levinson | Hualou Liang | Kaizhao Liang | Zhexiong Liu | Andrey Lukyanenko | Vukosi Marivate | Gerard de Melo | Simon Meoni | Maxine Meyer | Afnan Mir | Nafise Sadat Moosavi | Niklas Meunnighoff | Timothy Sum Hon Mun | Kenton Murray | Marcin Namysl | Maria Obedkova | Priti Oli | Nivranshu Pasricha | Jan Pfister | Richard Plant | Vinay Prabhu | Vasile Pais | Libo Qin | Shahab Raji | Pawan Kumar Rajpoot | Vikas Raunak | Roy Rinberg | Nicholas Roberts | Juan Diego Rodriguez | Claude Roux | Vasconcellos Samus | Ananya Sai | Robin Schmidt | Thomas Scialom | Tshephisho Sefara | Saqib Shamsi | Xudong Shen | Yiwen Shi | Haoyue Shi | Anna Shvets | Nick Siegel | Damien Sileo | Jamie Simon | Chandan Singh | Roman Sitelew | Priyank Soni | Taylor Sorensen | William Soto | Aman Srivastava | Aditya Srivatsa | Tony Sun | Mukund Varma | A Tabassum | Fiona Tan | Ryan Teehan | Mo Tiwari | Marie Tolkiehn | Athena Wang | Zijian Wang | Zijie Wang | Gloria Wang | Fuxuan Wei | Bryan Wilie | Genta Indra Winata | Xinyu Wu | Witold Wydmanski | Tianbao Xie | Usama Yaseen | Michael Yee | Jing Zhang | Yue Zhang
Northern European Journal of Language Technology, Volume 9
Kaustubh Dhole | Varun Gangal | Sebastian Gehrmann | Aadesh Gupta | Zhenhao Li | Saad Mahamood | Abinaya Mahadiran | Simon Mille | Ashish Shrivastava | Samson Tan | Tongshang Wu | Jascha Sohl-Dickstein | Jinho Choi | Eduard Hovy | Ondřej Dušek | Sebastian Ruder | Sajant Anand | Nagender Aneja | Rabin Banjade | Lisa Barthe | Hanna Behnke | Ian Berlot-Attwell | Connor Boyle | Caroline Brun | Marco Antonio Sobrevilla Cabezudo | Samuel Cahyawijaya | Emile Chapuis | Wanxiang Che | Mukund Choudhary | Christian Clauss | Pierre Colombo | Filip Cornell | Gautier Dagan | Mayukh Das | Tanay Dixit | Thomas Dopierre | Paul-Alexis Dray | Suchitra Dubey | Tatiana Ekeinhor | Marco Di Giovanni | Tanya Goyal | Rishabh Gupta | Louanes Hamla | Sang Han | Fabrice Harel-Canada | Antoine Honoré | Ishan Jindal | Przemysław Joniak | Denis Kleyko | Venelin Kovatchev | Kalpesh Krishna | Ashutosh Kumar | Stefan Langer | Seungjae Ryan Lee | Corey James Levinson | Hualou Liang | Kaizhao Liang | Zhexiong Liu | Andrey Lukyanenko | Vukosi Marivate | Gerard de Melo | Simon Meoni | Maxine Meyer | Afnan Mir | Nafise Sadat Moosavi | Niklas Meunnighoff | Timothy Sum Hon Mun | Kenton Murray | Marcin Namysl | Maria Obedkova | Priti Oli | Nivranshu Pasricha | Jan Pfister | Richard Plant | Vinay Prabhu | Vasile Pais | Libo Qin | Shahab Raji | Pawan Kumar Rajpoot | Vikas Raunak | Roy Rinberg | Nicholas Roberts | Juan Diego Rodriguez | Claude Roux | Vasconcellos Samus | Ananya Sai | Robin Schmidt | Thomas Scialom | Tshephisho Sefara | Saqib Shamsi | Xudong Shen | Yiwen Shi | Haoyue Shi | Anna Shvets | Nick Siegel | Damien Sileo | Jamie Simon | Chandan Singh | Roman Sitelew | Priyank Soni | Taylor Sorensen | William Soto | Aman Srivastava | Aditya Srivatsa | Tony Sun | Mukund Varma | A Tabassum | Fiona Tan | Ryan Teehan | Mo Tiwari | Marie Tolkiehn | Athena Wang | Zijian Wang | Zijie Wang | Gloria Wang | Fuxuan Wei | Bryan Wilie | Genta Indra Winata | Xinyu Wu | Witold Wydmanski | Tianbao Xie | Usama Yaseen | Michael Yee | Jing Zhang | Yue Zhang
Northern European Journal of Language Technology, Volume 9
Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.
Search
Fix author
Co-authors
- Monojit Choudhury 2
- Gaurja Aeron 1
- Utkarsh Agarwal 1
- Sajant Anand 1
- Nagender Aneja 1
- Debopriyo Banerjee 1
- Rabin Banjade 1
- Lisa Barthe 1
- Hanna Behnke 1
- Ian Berlot-Attwell 1
- Junaid Hamid Bhat 1
- Connor Boyle 1
- Caroline Brun 1
- Samuel Cahyawijaya 1
- Emile Chapuis 1
- Shivam Chauhan 1
- Wanxiang Che 1
- Jinho D. Choi 1
- Christian Clauss 1
- Pierre Colombo 1
- Filip Cornell 1
- Gautier Dagan 1
- Mayukh Das 1
- Rocktim Jyoti Das 1
- Gerard De Melo 1
- Kaustubh Dhole 1
- Marco Di Giovanni 1
- Tanay Dixit 1
- Thomas Dopierre 1
- Paul-Alexis Dray 1
- Suchitra Dubey 1
- Ondřej Dušek 1
- Nikhil E 1
- Tatiana Ekeinhor 1
- Ali El Filali 1
- Varun Gangal 1
- Sebastian Gehrmann 1
- Samujjwal Ghosh 1
- Gurpreet Gosal 1
- Tanya Goyal 1
- Aadesh Gupta 1
- Rishabh Gupta 1
- Louanes Hamla 1
- Sang Han 1
- Xudong Han 1
- Fabrice Harel-Canada 1
- Antoine Honoré 1
- Eduard Hovy 1
- Alok Anil Jadhav 1
- Ishan Jindal 1
- Madhur Jindal 1
- Przemysław Joniak 1
- Rituraj Joshi 1
- Samta Kamboj 1
- Denis Kleyko 1
- Fajri Koto 1
- Venelin Kovatchev 1
- Kalpesh Krishna 1
- Ashutosh Kumar 1
- Stefan Langer 1
- Seungjae Ryan Lee 1
- Corey James Levinson 1
- Zhenhao Li 1
- Haonan Li 1
- Hualou Liang 1
- Kaizhao Liang 1
- Zhexiong Liu 1
- Andrey Lukyanenko 1
- Abinaya Mahadiran 1
- Saad Mahamood 1
- Radhika Mamidi 1
- Vukosi Marivate 1
- Simon Meoni 1
- Niklas Meunnighoff 1
- Maxine Meyer 1
- Simon Mille 1
- Afnan Mir 1
- Nafise Sadat Moosavi 1
- Parvez Mullah 1
- Timothy Sum Hon Mun 1
- Kenton Murray 1
- Preslav Nakov 1
- Marcin Namysl 1
- Maria Obedkova 1
- Priti Oli 1
- Vasile Pais 1
- Rahul Pal 1
- Onkar Arun Pandit 1
- Nivranshu Pasricha 1
- Jan Pfister 1
- Richard Plant 1
- Vinay Prabhu 1
- Lalit Pradhan 1
- Libo Qin 1
- Zainul Abedien Ahmed Quraishi 1
- Shahab Raji 1
- Pawan Kumar Rajpoot 1
- Gokulakrishnan Ramakrishnan 1
- Vikas Raunak 1
- Roy Rinberg 1
- Nicholas Roberts 1
- Juan Diego Rodriguez 1
- Claude Roux 1
- Sebastian Ruder 1
- Dhruv Sahnan 1
- Sunil Kumar Sahu 1
- Ananya Sai 1
- Vasconcellos Samus 1
- Robin Schmidt 1
- Thomas Scialom 1
- Tshephisho Sefara 1
- Neha Sengupta 1
- Saqib Shamsi 1
- Avraham Sheinin 1
- Xudong Shen 1
- Yiwen Shi 1
- Freda Shi 1
- Ashish Shrivastava 1
- Awantika Shukla 1
- Anna Shvets 1
- Nick Siegel 1
- Damien Sileo 1
- Jamie Simon 1
- Chandan Singh 1
- Aaryamonvikram Singh 1
- Roman Sitelew 1
- Marco Antonio Sobrevilla Cabezudo 1
- Jascha Sohl-Dickstein 1
- Priyank Soni 1
- Taylor Sorensen 1
- William Soto Martinez 1
- Aman Srivastava 1
- Aditya Srivatsa 1
- Tony Sun 1
- A Tabassum 1
- Samson Tan 1
- Fiona Tan 1
- Ryan Teehan 1
- Mo Tiwari 1
- Marie Tolkiehn 1
- Mukund Varma 1
- Natalia Vassilieva 1
- Athena Wang 1
- Zijian Wang 1
- Zijie Wang 1
- Gloria Wang 1
- Fuxuan Wei 1
- Bryan Wilie 1
- Genta Indra Winata 1
- Tongshang Wu 1
- Xinyu Wu 1
- Witold Wydmanski 1
- Tianbao Xie 1
- Usama Yaseen 1
- Michael Yee 1
- Jing Zhang 1
- Yue Zhang 1