Debopriyo Banerjee
2026
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
2024
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning
Debrup Das | Debopriyo Banerjee | Somak Aditya | Ashish Kulkarni
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Debrup Das | Debopriyo Banerjee | Somak Aditya | Ashish Kulkarni
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
Search
Fix author
Co-authors
- Somak Aditya 1
- Utkarsh Agarwal 1
- Junaid Hamid Bhat 1
- Shivam Chauhan 1
- Mukund Choudhary 1
- Monojit Choudhury 1
- Debrup Das 1
- Rocktim Jyoti Das 1
- Ali El Filali 1
- Samujjwal Ghosh 1
- Gurpreet Gosal 1
- Xudong Han 1
- Alok Anil Jadhav 1
- Rituraj Joshi 1
- Samta Kamboj 1
- Fajri Koto 1
- Ashish Kulkarni 1
- Haonan Li 1
- Parvez Mullah 1
- Preslav Nakov 1
- Rahul Pal 1
- Onkar Arun Pandit 1
- Lalit Pradhan 1
- Zainul Abedien Ahmed Quraishi 1
- Gokulakrishnan Ramakrishnan 1
- Dhruv Sahnan 1
- Sunil Kumar Sahu 1
- Neha Sengupta 1
- Avraham Sheinin 1
- Awantika Shukla 1
- Aaryamonvikram Singh 1
- Natalia Vassilieva 1