Hrishikesh Terdalkar

2025

pdf bib abs
BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages
Hrishikesh Terdalkar | Kirtan Bhojani | Aryan Dongare | Omm Aditya Behera
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)

Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.

pdf bib abs
Findings of the IndicGEC and IndicWG Shared Task at BHASHA 2025
Pramit Bhattacharyya | Karthika N J | Hrishikesh Terdalkar | Manoj Balaji Jagadeeshan | Shubham Kumar Nigam | Arvapalli Sai Susmitha | Arnab Bhattacharya
Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)

This overview paper presents the findings of the two shared tasks organized as part of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA) co-located with IJCNLP-AACL 2025. The shared tasks are: (1) Indic Grammar Error Correction (IndicGEC) and (2) Indic Word Grouping (IndicWG). For GEC, participants were tasked with producing grammatically correct sentences based on given input sentences in five Indian languages. For WG, participants were required to generate a word-grouped variant of a provided sentence in Hindi. The evaluation metric used for GEC was GLEU, while Exact Matching was employed for WG. A total of 14 teams participated in the final phase of the Shared Task 1; 2 teams participated in the final phase of Shared Task 2. The maximum GLEU scores obtained for Hindi, Bangla, Telugu, Tamil and Malayalam languages are respectively 85.69, 95.79, 88.17, 91.57 and 96.02 for the IndicGEC shared task. The highest exact matching score obtained for IndicWG shared task is 45.13%.

Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages—Sanskrit, Ancient Greek and Latin—to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question–answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.

2024

pdf bib
Aganittyam: Learning Tamil Grammar through Knowledge Graph based Templatized Question Answering
Mithilesh K | Amarjit Madhumalararungeethayan | Dharanish Rahul S | Abhijith Balan | C Oswald | Hrishikesh Terdalkar
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2023

pdf bib abs
Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation
Hrishikesh Terdalkar | Arnab Bhattacharya
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at https://github.com/Antarlekhaka/code

pdf bib
Chandojnanam: A Sanskrit Meter Identification and Utilization System
Hrishikesh Terdalkar | Arnab Bhattacharya
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference

pdf bib
Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text
Hrishikesh Terdalkar | Arnab Bhattacharya | Madhulika Dubey | S Ramamurthy | Bhavna Naneria Singh
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference

2022

pdf bib abs
A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit
Jivnesh Sandhan | Ashish Gupta | Hrishikesh Terdalkar | Tushar Sandhan | Suvendu Samanta | Laxmidhar Behera | Pawan Goyal
Proceedings of the 29th International Conference on Computational Linguistics

The phenomenon of compounding is ubiquitous in Sanskrit. It serves for achieving brevity in expressing thoughts, while simultaneously enriching the lexical and structural formation of the language. In this work, we focus on the Sanskrit Compound Type Identification (SaCTI) task, where we consider the problem of identifying semantic relations between the components of a compound word. Earlier approaches solely rely on the lexical information obtained from the components and ignore the most crucial contextual and syntactic information useful for SaCTI. However, the SaCTI task is challenging primarily due to the implicitly encoded context-sensitive semantic relation between the compound components. Thus, we propose a novel multi-task learning architecture which incorporates the contextual information and enriches the complementary syntactic information using morphological tagging and dependency parsing as two auxiliary tasks. Experiments on the benchmark datasets for SaCTI show 6.1 points (Accuracy) and 7.7 points (F1-score) absolute gain compared to the state-of-the-art system. Further, our multi-lingual experiments demonstrate the efficacy of the proposed architecture in English and Marathi languages.