Rajarshi Ghosh
2025
MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
Rajarshi Ghosh
|
Abhay Gupta
|
Hudson McBride
|
Anurag Jayant Vaidya
|
Faisal Mahmood
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)
Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23k items each (69k total). We evaluate a frontier LLM and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS > 0.80) but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering—even when final diagnoses remain unchanged. Error analysis identifies specific cases where reasoning shifts occur, highlighting clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA provides a controlled diagnostic setting for auditing reasoning stability in medical AI.
2024
DiversityMedQA: A Benchmark for Assessing Demographic Biases in Medical Diagnosis using Large Language Models
Rajat Rawat
|
Hudson McBride
|
Rajarshi Ghosh
|
Dhiyaan Nirmal
|
Jong Moon
|
Dhruv Alamuri
|
Sean O'Brien
|
Kevin Zhu
Proceedings of the Third Workshop on NLP for Positive Impact
As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises of medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. To ensure that our perturbations did not alter the clinical outcomes, we implemented a filtering strategy to validate each perturbation, so that any performance discrepancies would be indicative of bias. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
Search
Fix author
Co-authors
- Hudson McBride 2
- Dhruv Alamuri 1
- Sean O Brien 1
- Abhay Gupta 1
- Faisal Mahmood 1
- show all...