Madeline Bittner


2026

Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.
Much of our knowledge about anatomy and physiology is found in text format in research papers and medical textbooks. For an information system to have access to this knowledge, extracting and translating it into a computable format that can be stored in an ontology or knowledge graph is advantageous. Unfortunately, existing text mining corpora, which are needed to train and evaluate data mining models, are old and consist almost entirely of research papers, which rarely contain complete information needed to capture complex normal physiological processes and, subsequently, understand the pathophysiology of a disease. As a first step to filling in this gap, we have developed a guide for annotating medical textbooks for physiological events and entities involved in these events. In addition to providing our guidelines and describing the guideline development process, we analyze the coverage of normal physiology in existing ontologies.
Patient portal messages often embed clinical questions inside long, emotionally nuanced narratives, requiring clinicians to infer the underlying information need. We study the task of rewriting verbose patient-authored narratives into concise, clinician-interpreted questions framed as if querying an electronic health record (EHR) system. We evaluate a lightweight LLM-based rewrite pipeline that constrains outputs to 10-15 words and uses rule-based validation with regeneration. We test the approach on 140 distinct patient questions drawn from the ArchEHR-QA dataset and shared task. Each system output is double-annotated by two annotators for quality (Good/Ok/Bad) and error types (Generic, Malformed, Tangential, Hallucination). Results show that while models follow output constraints, they often produce overly generic or tangential questions, and occasional hallucinations introduce unsupported clinical details. Across both clinician-question and patient-narrative comparison settings, automatic metrics show substantial overlap across human quality labels; in pairwise meta-evaluation, BERTScore is the strongest proxy for human preferences. We release our code and annotations to support future work.