Evaluating LLM-as-a-Judge for Medical Term Simplification

Ioana Buhnila; Aman Sinha; Rohit Agarwal; Dilip K. Prasad; Mathieu Constant

Evaluating LLM-as-a-Judge for Medical Term Simplification

Ioana Buhnila, Aman Sinha, Rohit Agarwal, Dilip K. Prasad, Mathieu Constant

Abstract

Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.

Anthology ID:: 2026.bionlp-1.55
Volume:: BioNLP 2026
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:: BioNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 687–694
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.55/
DOI:
Bibkey:
Cite (ACL):: Ioana Buhnila, Aman Sinha, Rohit Agarwal, Dilip K. Prasad, and Mathieu Constant. 2026. Evaluating LLM-as-a-Judge for Medical Term Simplification. In BioNLP 2026, pages 687–694, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Evaluating LLM-as-a-Judge for Medical Term Simplification (Buhnila et al., BioNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.55.pdf

PDF Cite Search Fix data