When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Yikun Han, Mengfei Lan, Halil Kilicoglu


Abstract
Biomedical retrieval-augmented LLMs are often evaluated under helpful retrieved context, but in practice the evidence can also be misleading or internally conflicting. This paper studies uncertainty under those harder settings using the HealthContradict benchmark and six open-weight models. We evaluate five controlled evidence conditions: no context, correct-only context, incorrect-only context, and two mixed conditions that contain the same correct and contradictory documents in opposite orders. Correct evidence improves both accuracy and calibration, while incorrect evidence substantially degrades both. Under conflicting evidence, document order also matters: reversing the order of the same two documents changes 11.4%–25.2% of predictions and consistently reduces performance when the incorrect document appears first. We further evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, incorrect-only and incorrect-first conflict, this score improves selective accuracy over confidence-only abstention, with mean gains of 7.2–33.4 and 3.6–14.4 points across 75%, 50%, and 25% coverage. These results show that biomedical RAG systems should be evaluated not only under helpful retrieval, but also under misleading and conflicting evidence.
Anthology ID:
2026.bionlp-1.50
Volume:
BioNLP 2026
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
630–643
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.50/
DOI:
Bibkey:
Cite (ACL):
Yikun Han, Mengfei Lan, and Halil Kilicoglu. 2026. When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering. In BioNLP 2026, pages 630–643, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering (Han et al., BioNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.50.pdf