MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

Wen-wai Yim; Asma Ben Abacha; Zixuan Yu; Robert Doerning; Fei Xia; Meliha Yetisgen-Yildiz

MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen

Abstract

Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses exists. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. We release our code and annotations to support future research.

Anthology ID:: 2026.lrec-main.396
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 5035–5054
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.396/
DOI:
Bibkey:
Cite (ACL):: Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, and Meliha Yetisgen. 2026. MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering. International Conference on Language Resources and Evaluation, main:5035–5054.
Cite (Informal):: MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering (Yim et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.396.pdf

PDF Cite Search Fix data