Rohit Agarwal

2026

Evaluating LLM-as-a-Judge for Medical Term Simplification
Ioana Buhnila | Aman Sinha | Rohit Agarwal | Dilip K. Prasad | Mathieu Constant
BioNLP 2026

Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.

2025

pdf bib abs

SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications
Aman Sinha | Federica Gamba | Raúl Vázquez | Timothee Mickus | Ahana Chattopadhyay | Laura Zanella | Binesh Arakkal Remesh | Yash Kankanampati | Aryan Chandramania | Rohit Agarwal
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

This paper presents an overview of the SHROOM-CAP Shared Task, which focuses on detecting hallucinations and over-generation errors in cross-lingual analyses of scientific publications. SHROOM-CAP covers nine languages: five high-resource (English, French, Hindi, Italian, and Spanish) and four low-resource (Bengali, Gujarati, Malayalam, and Telugu). The task frames hallucination detection as a binary classification problem, where participants must predict whether a given text contains factual inaccuracies and fluency mistakes. We received 1,571 submissions from 5 participating teams during the test phase over the nine languages. In the paper, we present an analysis of the evaluated systems to assess their performance on the hallucination detection task across languages. Our findings reveal a disparity in system performance between high-resource and low-resource languages. Furthermore, we observe that factuality and fluency tend to be closely aligned in high-resource languages, whereas this correlation is less evident in low-resource languages. Overall, SHROOM-CAP underlines that hallucination detection remains a challenging open problem, particularly in low-resource and domain-specific settings.

pdf bib

Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)
Aman Sinha | Raúl Vázquez | Timothee Mickus | Rohit Agarwal | Ioana Buhnila | Patrícia Schmidtová | Federica Gamba | Dilip K. Prasad | Jörg Tiedemann
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

2020

pdf bib abs

DSC-IIT ISM at WNUT-2020 Task 2: Detection of COVID-19 informative tweets using RoBERTa
Sirigireddy Dhana Laxmi | Rohit Agarwal | Aman Sinha
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Social media such as Twitter is a hotspot of user-generated information. In this ongoing Covid-19 pandemic, there has been an abundance of data on social media which can be classified as informative and uninformative content. In this paper, we present our work to detect informative Covid-19 English tweets using RoBERTa model as a part of the W-NUT workshop 2020. We show the efficacy of our model on a public dataset with an F1-score of 0.89 on the validation dataset and 0.87 on the leaderboard.

pdf bib abs

C-Net: Contextual Network for Sarcasm Detection
Amit Kumar Jena | Aman Sinha | Rohit Agarwal
Proceedings of the Second Workshop on Figurative Language Processing

Automatic Sarcasm Detection in conversations is a difficult and tricky task. Classifying an utterance as sarcastic or not in isolation can be futile since most of the time the sarcastic nature of a sentence heavily relies on its context. This paper presents our proposed model, C-Net, which takes contextual information of a sentence in a sequential manner to classify it as sarcastic or non-sarcastic. Our model showcases competitive performance in the Sarcasm Detection shared task organised on CodaLab and achieved 75.0% F1-score on the Twitter dataset and 66.3% F1-score on Reddit dataset.