Pramit Bhattacharyya

2025

Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages—Sanskrit, Ancient Greek and Latin—to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question–answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.

pdf bib abs
Leveraging LLMs for Bangla Grammar Error Correction: Error Categorization, Synthetic Data, and Model Evaluation
Pramit Bhattacharyya | Arnab Bhattacharya
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) perform exceedingly well in Natural Language Understanding (NLU) tasks for many languages including English. However, despite being the fifth most-spoken language globally, Grammatical Error Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate how LLMs can be leveraged for improving Bangla GEC. For that, we first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors. We next devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones. The Vaiyākaraṇa dataset, thus created, consists of 5,67,422 sentences of which 2,27,119 are erroneous. This dataset is then used to instruction-tune LLMs for the task of GEC in Bangla. Evaluations show that instruction-tuning with Vaiyākaraṇa improves GEC performance of LLMs by 3-7 percentage points as compared to the zero-shot setting, and makes them achieve human-like performance in grammatical error identification. Humans, though, remain superior in error correction. The data and code are available from https://github.com/Bangla-iitk/Vaiyakarana.

pdf bib abs
BanglaByT5: Byte-Level Modelling for Bangla
Pramit Bhattacharyya | Arnab Bhattacharya
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have achievedremarkable success across various natural lan-guage processing tasks. However, most LLMmodels use traditional tokenizers like BPE andSentencePiece, which fail to capture the finernuances of a morphologically rich languagelike Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla.Built upon a small variant of Google’s ByT5architecture, BanglaByT5 is pre-trained on a14GB curated corpus combining high-qualityliterary and newspaper articles. Through zero-shot and supervised evaluations across gen-erative and classification tasks, BanglaByT5demonstrates competitive performance, surpassing several multilingual and larger models.Our findings highlight BanglaByT5’s potentialas a lightweight yet powerful tool for BanglaNLP, particularly in resource-constrained orscalable environments. BanglaByT5 is pub-licly available for download from https://huggingface.co/Vacaspati/BanglaByT5.

2023

pdf bib abs
VacLM at BLP-2023 Task 1: Leveraging BERT models for Violence detection in Bangla
Shilpa Chatterjee | P J Leo Evenss | Pramit Bhattacharyya
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This study introduces the system submitted to the BLP Shared Task 1: Violence Inciting Text Detection (VITD) by the VacLM team. In this work, we analyzed the impact of various transformer-based models for detecting violence in texts. BanglaBERT outperforms all the other competing models. We also observed that the transformer-based models are not adept at classifying Passive Violence and Direct Violence class but can better detect violence in texts, which was the task’s primary objective. On the shared task, we secured a rank of 12 with macro F1-score of 72.656%.

pdf bib
VACASPATI: A Diverse Corpus of Bangla Literature
Pramit Bhattacharyya | Joydeep Mondal | Subhadip Maji | Arnab Bhattacharya
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs
LSJSP at SemEval-2023 Task 2: FTBC: A FastText based framework with pre-trained BERT for NER
Shilpa Chatterjee | Leo Evenss | Pramit Bhattacharyya | Joydeep Mondal
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This study introduces the system submitted to the SemEval 2022 Task 2: MultiCoNER II (Multilingual Complex Named Entity Recognition) by the LSJSP team. We propose FTBC, a FastText-based framework with pre-trained Bert for NER tasks with complex entities and over a noisy dataset. Our system achieves an average of 58.27% F1 score (fine-grained) and 75.79% F1 score (coarse-grained) across all languages. FTBC outperforms the baseline BERT-CRF model on all 12 monolingual tracks.