Davis Bartels


2026

We study how simple linguistic features relate to reader preferences in medical question answering. Our dataset contains answers to medical questions ranked in order of quality. We examine eight interpretable features of the answer text: length in words, average words per sentence, percentage of polysyllabic words, medical named entity density, perplexity, coherence, and dependency distance. We find substantial variation across annotators in both the strength and direction of these relationships. Answer length shows some of the strongest associations and predictive signals, but preferences are not consistent across annotators, with some favoring longer answers and others favoring shorter ones. A leave-one-out ablation study shows the relative impact on the predictive accuracy of our models. Overall, these results suggest that linguistic form can influence reader preference in medical text, but that these effects vary across readers and may be more complex than simple linear correlations.
Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.

2025

In this paper, we present an overview of CLINIQLINK a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4 978 expert-verified, medical source-grounded question–answer pairs that cover seven formats - true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland’s Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.
The evaluation of text generated by LLMs remains a challenge for question answering, retrieval augmented generation (RAG), summarization, and many other natural language processing tasks. Evaluating the factuality of LLM generated responses is particularly important in medical question answering, where the stakes are high. One method of evaluating the factuality of text is through the use of information nuggets (answer keys). Nuggets are text representing atomic facts that may be used by an assessor to make a binary decision as to whether the fact represented by said nugget is contained in an answer. Although manual nugget extraction is expensive and time-consuming, recent RAG shared task evaluations have explored automating the nuggetization of text with LLMs. In this work, we explore several approaches to nugget generation for medical question answering and evaluate their alignment with expert human nugget generation. We find providing an example and extracting nuggets from an answer to be the best approach to nuggetization. While, overall, we found the capabilities of LLMs to distill atomic facts limited, Llama 3.3 performed the best out of the models we tested.