Towards Unified Factuality Evaluation for Biomedical QA and Summarization: Aligning Metrics with Clinical Use-Cases

Mahule Roy, Subhas Roy


Abstract
Large language models achieve strong performance on biomedical question answering and summarization benchmarks, yet traditional evaluation metrics often fail to detect clinically significant factual errors. We introduce a unified evaluation framework that combines reference-based measures with evidence-grounded factuality verification to assess biomedical text generation. Evaluating four open-source models across three benchmarks (BioASQ, PubMedQA, MedLFQA), we find that 13.4?24.7% of generated claims are contradicted and 23?41% are unsupported, despite high lexical overlap scores. Our proposed Fact-Aligned Score (FAS) correlates strongly with claim-level verifiability (rho=0.68), substantially outperforming ROUGE-L (rho=0.41). We release an open-source toolkit with model outputs and analysis scripts to support reproducible factuality evaluation and safer deployment of biomedical LLMs.
Anthology ID:
2026.bionlp-1.2
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15–19
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.2/
DOI:
Bibkey:
Cite (ACL):
Mahule Roy and Subhas Roy. 2026. Towards Unified Factuality Evaluation for Biomedical QA and Summarization: Aligning Metrics with Clinical Use-Cases. In BioNLP 2026, pages 15–19, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Towards Unified Factuality Evaluation for Biomedical QA and Summarization: Aligning Metrics with Clinical Use-Cases (Roy & Roy, BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.2.pdf