Brandon Colelough
2026
Relations of Linguistic Features and Medical Text Preferences are Nontrivial
Davis Bartels | Brandon Colelough | Dina Demner-Fushman
BioNLP 2026
Davis Bartels | Brandon Colelough | Dina Demner-Fushman
BioNLP 2026
We study how simple linguistic features relate to reader preferences in medical question answering. Our dataset contains answers to medical questions ranked in order of quality. We examine eight interpretable features of the answer text: length in words, average words per sentence, percentage of polysyllabic words, medical named entity density, perplexity, coherence, and dependency distance. We find substantial variation across annotators in both the strength and direction of these relationships. Answer length shows some of the strongest associations and predictive signals, but preferences are not consistent across annotators, with some favoring longer answers and others favoring shorter ones. A leave-one-out ablation study shows the relative impact on the predictive accuracy of our models. Overall, these results suggest that linguistic form can influence reader preference in medical text, but that these effects vary across readers and may be more complex than simple linear correlations.
Towards Grounded Hallucination Definitions for Biomedical Question Answering with Reproducible Examples from ClinIQLink
Brandon Colelough | Davis Bartels | Madeline Bittner | Dina Demner-Fushman
BioNLP 2026
Brandon Colelough | Davis Bartels | Madeline Bittner | Dina Demner-Fushman
BioNLP 2026
Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.
2025
Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering
Brandon Colelough | Davis Bartels | Dina Demner-Fushman
Proceedings of the 24th Workshop on Biomedical Language Processing
Brandon Colelough | Davis Bartels | Dina Demner-Fushman
Proceedings of the 24th Workshop on Biomedical Language Processing
In this paper, we present an overview of CLINIQLINK a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4 978 expert-verified, medical source-grounded question–answer pairs that cover seven formats - true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland’s Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.