BioNLP 2026 - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

BioNLP 2026

Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii (Editors)

Anthology ID:: 2026.bionlp-1
Month:: July
Year:: 2026
Address:: San Diego, California
Venues:: BioNLP | WS
Events:: Annual Meeting of the Association for Computational Linguistics (2026) | Biomedical Natural Language Processing Workshop (2026) | Other Workshops and Events (2026)
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1/
DOI:
ISBN:: 979-8-89176-434-7
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.pdf

PDF (full) BibTeX Search

BioNLP 2026
Dina Demner-Fushman | Sophia Ananiadou | Kirk Roberts | Junichi Tsujii

The Divergence Hypothesis: Unmasking Lexical Interference and Label Bias in Mental Health NLP
Moustafa Hassan

Computational mental health (CMH) classifiers often degrade under distribution shift because human annotators and distant-supervision pipelines reward different linguistic signals. We introduce TSS (Triple-Stream Stress probe), a multi-channel diagnostic framework that decomposes text into (A) lexical character n-grams, (B) a small, mostly content-free morpho-syntactic channel, and (C) a 154-feature psycholinguistic style channel. Across four English datasets (N = 12,906), TSS reveals a lexical interference effect: adding lexical features to the style channel reduces Macro-F1 on human-labeled data (mean drop 0.072, p 10??) but not on auto-labeled data. We propose Degree of Divergence (DoD), a difference-in-differences statistic adapted from econometrics for label-source auditing, with instance-level bootstrap inference; the headline estimate is DoD(BC?A) = 0.0374, 95% CI [0.0097, 0.0651], p = 0.0032. A platform-stratified Twitter-only DoD (which removes the Reddit vs. Twitter contrast) reproduces the pattern with bootstrap inference: DoD??,BC?A = +0.096 (p 0.001) and DoD??,AC?A = ?0.089 (p 0.001). Interventional masking (pos_only) retains ?95?99% of Channel C’s performance after destroying content words on human datasets, indicating that the style channel does not rely primarily on lexical surface form. TSS is positioned as a diagnostic audit framework, not a clinical screening tool: it flags label-source-specific shortcut learning before generalization claims are made.

Towards Unified Factuality Evaluation for Biomedical QA and Summarization: Aligning Metrics with Clinical Use-Cases
Mahule Roy | Subhas Roy

Large language models achieve strong performance on biomedical question answering and summarization benchmarks, yet traditional evaluation metrics often fail to detect clinically significant factual errors. We introduce a unified evaluation framework that combines reference-based measures with evidence-grounded factuality verification to assess biomedical text generation. Evaluating four open-source models across three benchmarks (BioASQ, PubMedQA, MedLFQA), we find that 13.4?24.7% of generated claims are contradicted and 23?41% are unsupported, despite high lexical overlap scores. Our proposed Fact-Aligned Score (FAS) correlates strongly with claim-level verifiability (rho=0.68), substantially outperforming ROUGE-L (rho=0.41). We release an open-source toolkit with model outputs and analysis scripts to support reproducible factuality evaluation and safer deployment of biomedical LLMs.

Using Synthetic Records to Improve Automated Identification of Seizure Freedom in Clinical Text about People with Epilepsy
Stephen Barlow | Yujian Gan | Joe Davies | Joel Winston | James Teo | Mark Richardson | Ben Holgate

Seizure freedom is a key clinical outcome for people with epilepsy (PWE) yet it is primarily recorded in free-text notes and letters in the United Kingdom, making it difficult to aggregate and track at scale. This paper introduces a generative LLM-based pipeline boosted by synthetic data to identify a PWE’s seizure freedom status in clinicians’ records. We fine-tuned seven different LLMs with between 4-14 billion parameters using LoRA to compare models trained on synthetic records against those trained on expert annotated records. The best performing configuration, based on Qwen-2.5-14B, was trained entirely on synthetic records and used chain-of-thought (CoT) reasoning (both generated by GPT-5). This achieved an F1 score of 0.90±0.02 on double-annotated test data and outperformed the equivalent model trained on authentic clinician records, which achieved 0.87±0.04. The synthetically trained models also have the benefit of outputting their CoT reasoning process for greater decision-making transparency and can also make use of the unused supervised training data for significantly increased test examples. This work has implications for monitoring a key treatment outcome for PWE automatically and at scale.

Analyzing Prompt Design Choices in Biomedical Information Extraction for Low-Resource Languages
Ayesha Khatun | Kadir Bulut Ozler | Steven Bethard | Egoitz Laparra

This paper studies how to improve biomedical named entity recognition (NER) using large language models (LLMs), especially for low-resource languages like Bangla and Basque. The main goal is to understand how different prompt styles and output formats affect model performance. The study finds that the way we design prompts is very important. Among all methods, question-style prompting works best across all languages. It helps the model understand the biomedical task more clearly and improves accuracy. In fact, improvements are much greater in Bangla and Basque compared to high-resource languages like English and Spanish. Another key finding is about the output format. Traditional BIO tagging (labeling each word) performs poorly with LLMs because it is strict and sensitive to small errors. Instead, span-based extraction (directly extracting text phrases) works much better and gives higher F1 scores. This is because LLMs naturally generate text spans rather than token-level labels. The paper also analyzes errors. Common problems include hallucination, missing entities, and boundary mistakes. Translation-based prompts can reduce hallucination, while question-style prompts reduce empty outputs in biomedical NER. Overall, the study shows that choosing the right prompt and output format is very important, especially for low-resource high-vocabulary languages. It provides useful guidance for building better multilingual medical information extraction systems.

Hierarchy-Aware Hyperbolic and Semantic Reranking for Ontology-Based Phenotype Linking
Thomas Labbe | Moussa Baddour | Axel Bonesteve | Paul Rollier | Marie De Tayrac | Olivier Dameron

Extracting structured knowledge from unstructured text is a fundamental challenge in machine learning, particularly for concepts organized within complex hierarchical ontologies. In genomics, identifying phenotypes from clinical narratives is crucial for diagnostic precision, yet current methods struggle with contextual interpretation and subtle clinical descriptions. We present a hierarchy-aware workflow for ontology-based phenotype linking that combines semantic and hierarchical signals. Our approach integrates Large Language Models for span detection with retrieval and a hybrid reranking strategy using both Euclidean (semantic) and hyperbolic (hierarchical) embeddings trained on the Human Phenotype Ontology. We show that while hyperbolic embeddings alone do not outperform standard semantic retrieval, they provide complementary structural signals that improve performance over strong baselines when combined with Euclidean representations. In particular, the hybrid approach outperforms existing state-of-the-art methods and yields more hierarchically coherent predictions, especially in settings involving implicit phenotype mentions. Experiments on a public benchmark (ID-68) and a newly released clinical dataset (CHU-50), publicly released with code and data, highlight both performance gains and improved alignment with ontology structure. We further introduce a hierarchy-aware evaluation framework that reflects clinical relevance beyond exact-match metrics.

Agentic Feature Selection via LLM for Epileptic Seizure Detection
Aizierjiang Aiersilan | Xiaodong Qu

Automated epileptic seizure detection from electroencephalography (EEG) signals is a clinically important task in which feature selection is typically performed using purely statistical criteria. We investigate whether a small instruction-tuned large language model (LLM) can guide iterative feature selection for binary seizure detection on the Epileptic Seizure Recognition dataset (11{,}500 samples, 178 features). The LLM agent (Qwen2.5-1.5B-Instruct) receives five complementary statistical summaries and selects a feature subset through multi-round reasoning. The agent achieves 96.5\% accuracy and 0.911 F1 with 40 features, compared to 97.9\% accuracy and 0.946 F1 for the best full-feature baseline (SVM-RBF on 178 features). Critically, 39 of the agent’s 40 features coincide with the top-39 mutual-information features, and a deterministic Top-39 MI filter, evaluated by the same Random Forest classifier, attains the same 96.5\% accuracy and 0.911 F1. We therefore present this work as an empirical baseline: at the 1.5B-parameter scale, the LLM behaves close to a univariate MI ranker. We situate the result against the recent LLM-based feature selection literature and enumerate the ablations and multi-dataset extensions required to determine whether larger or domain-specialized LLMs add value beyond statistical filtering.

Training Biomedical Retrievers From Large-Scale Citation Contexts
Xing David Wang | Duy Le Thanh | Ulf Leser

The MedCPT model has demonstrated that strong biomedical retrievers can be trained using proprietary PubMed search logs. In this work, we study whether freely available citation sentences are sufficient to train similarly effective models. We construct a large-scale training dataset of ~ 62 million citation sentence-abstract pairs extracted from PubMed Central. We train a lightweight BERT-based retriever-reranker model called CiteRec on this dataset and evaluate it across three benchmark settings: (a) the biomedical subset of BEIR for information retrieval, (b) SciRepEval for generalizable scientific document embeddings, and (c) CitancePlus, a new set of ~ 90 thousand citation sentence-abstract pairs for PubMed-scale citation recommendation. We show that CiteRec performs competitively with MedCPT on the biomedical BEIR subset and outperforms it on SciRepEval. On CitancePlus, CiteRec achieves strong performance for citation recommendation over the full PubMed corpus, outperforming both MedCPT and a substantially larger Qwen3-Embedding-8B retriever.

Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
Rodrigo Morales-Sánchez | Soto Montalvo | Raquel Martínez

Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.The obtained results show that explicit, decoupled uncertainty quantification is essential for translating biomedical NLP into responsible clinical practice.

Gold Label Errors in the SciFact Benchmark: An LLM-Assisted Annotation Audit
Julien Sylvestre

SciFact is a widely-used benchmark for scientific claim verification (645 citations, included in the BEIR evaluation suite). We present, to our knowledge, the first systematic annotation audit of its development and training sets, combining automated screening with a small language model ($0.11 in API fees) and exhaustive manual verification against source publications. We identify 11 gold-label errors in the development set (5.3%, 95% CI 2.7?9.2%, of 209 audited claim?document pairs) and 13 in the training set (2.3%, 95% CI 1.2?3.9%, of 564 audited pairs). The dev errors exhibit a directional asymmetry?9 of 11 mislabel a claim as SUPPORT (one-sided binomial p=0.033, two-sided p=0.065)?and fall into four recurring types. Correcting the dev labels raises binary macro-F1 by 1.7?3.8 points across GPT-5.4 (mini, nano) and Claude Haiku 4.5; gains are larger in 3-way evaluation when mislabeled evidence is recast as NEI (e.g., +9.2 with Haiku 4.5). The binary range is comparable in magnitude to inter-system margins on the SciFact leaderboard. A simple claim-only probe with Haiku 4.5 does not support label memorization as the main explanation for these gains. We release corrected annotations and a blind annotator packet, and recommend that benchmark users prefer the corrected release going forward.

BioRAG: A Systematic Ablation Study of Retrieval Strategies for Biomedical Question Answering
Krushil Bhojani | Mayank Waghmare | Hima Bindu Nandyala

Retrieval strategy selection is a critical but understudied design decision in biomedical RAG systems. Existing evaluations rely on lexical metrics that miss answer grounding, or require proprietary infrastructure that limits reproducibility. We present BioRAG, a head-to-head ablation of seven retrieval strategies on BioASQ-13b, evaluated using four RAGAs metrics with a locally deployed judge at zero monetary cost. Hybrid BM25 plus dense retrieval with Reciprocal Rank Fusion achieves faithfulness of 0.534 and context recall of 0.507, improvements of 50% and 85% over naive dense retrieval, confirmed across three random seed re-samples. HyDE improves faithfulness by 14% but reduces context precision by 52%, a tradeoff not previously documented on BioASQ. No single strategy dominates all four metrics, indicating that strategy selection must be application-driven. Sensitivity analysis across k in {3,5,10} confirms ranking stability. A domain mismatch diagnostic confirms 2% corpus coverage failure. The full pipeline runs on consumer hardware without paid APIs, directly addressing BioNLP 2026’s emphasis on reproducibility and evaluation frameworks for health-related applications.

Post Hoc Agentic Refinement for Improving Precision in Multilingual Clinical Text De-identification
Justin Xu | Alistair Johnson | Thomas Lin | David Eyre | Rodolfo Quispe

De-identification systems prioritize recall to protect privacy, but excessive over-tagging reduces data utility. We propose an agentic refiner that reviews high-recall annotations using lightweight tools (validation functions, adaptive context retrieval, persistent to-do state, and modular review skills) to improve precision while minimizing recall loss. Experiments across three multilingual datasets show that the agent achieves significant improvements to binary precision. To support fine-grained analysis, we further introduce a synthetic error dataset of common and systemic failure modes, on which the agent corrects 99% of injected errors in the medical datasets. Our results suggest that agent-based refinement provides a flexible and effective mechanism for improving de-identification precision as a modular extension to existing high-recall systems.

Do Syntactic Features Help Biomedical Relation Extraction? An Empirical Study of Verb Token and Dependency Graph Augmentation
Mustafa Sikder | Ernest Kwegyir-Afful

We investigate whether explicit syntactic features improve transformer-based biomedical relation extraction when added to typed entity marker pooling. We evaluate two augmentation strategies on top of BiomedBERT: (1) verb token augmentation, which concatenates the hidden state of the dependency root verb to the entity representations, and (2) a two-layer graph convolutional network (GCN) that refines encoder hidden states over the dependency parse before entity pooling. We experimented on three biomedical datasets: ChemProt, DDI, and AIMed with three random seeds. We found neither strategy consistently outperformed the entity-only baseline. The GCN yielded modest gains on AIMed (+0.007 F1) and ChemProt (+0.003 F1) but decreased performance on DDI (-0.013 F1). Verb token augmentation helps only on AIMed (+0.004 F1) and underperforms on the other two datasets. A syntactic characterization of the datasets reveals that DDI has substantially higher passive voice usage (50.7\% of relation-bearing sentences) than AIMed (27.0\%) or ChemProt (30.9\%), suggesting that syntactic augmentation is more effective when sentences exhibit active verbal structure with semantically informative predicates. These results suggest that corpus-level syntactic characteristics, particularly passive voice usage, may moderate the utility of explicit syntactic augmentation, though the small magnitude of observed differences warrants caution in interpretation.

Beyond Knowledge Graphs: PubMedBERT Embeddings as a Competitive Standalone Modality for Drug Re-purposing
Rishik Kondadadi | John E. Ortega

Drug repurposing methods rely heavily on knowledge graph (KG) embeddings, but building and curating these graphs takes considerable effort. We present two findings on the Hetionet drug-disease benchmark and an epilepsy ranking task. First, PubMedBERT text embeddings, fed through the same downstream classifiers and identical 10-fold splits as four re-trained KG baselines (TransE, ComplEx, DistMult, RotatE), reach AUROC $0.910$, above all four (best: RotatE, $0.854$); a Random Forest on the same vectors scores $0.880$. The comparison is asymmetric in one important way: PubMedBERT was pretrained on the literature Hetionet was curated from, so the result is best read as “text-with-literature-supervision vs.graph-only,” and a head-to-head with text-augmented KG methods (KG-BERT, TxGNN) is left as follow-up. Second, across all seven combinations of text, molecular (ECFP4), and gene expression (LINCS L1000) features, cross-attention fusion of weaker modalities into text consistently degrades performance, despite a gated mechanism intended to suppress unhelpful modalities; the residual path forces the strong modality to absorb noise. The model also ranks proconvulsants (amoxapine, flumazenil) near the top, because text embeddings encode strength of association with a disease but not its direction.

When Demographic Sensitivity Isn’t What It Seems: Baseline-Aware Counterfactual Audits for Clinical NLP
Hyunwoo Yoo

Clinical NLP systems are increasingly used for triage support, prediction, and decision assistance in EHR-based settings, where demographic fairness is a critical concern. A common evaluation approach is counterfactual demographic perturbation: modifying attributes such as age or sex while holding clinical evidence fixed and measuring output changes. However, we show that such counterfactual audits can be misleading when interpreted in isolation. Across three clinical LLMs, we find that non-demographic control perturbations (e.g., paraphrases) often induce output variability comparable to or greater than demographic edits. This can contribute to overestimation or misinterpretation of demographic bias.To address this, we propose a baseline-aware audit framework that explicitly compares demographic perturbations against control baselines. Our analysis reveals that (i) label-level stability can mask substantial variation in generated rationales and recommendations, and (ii) age-based perturbations generally induce larger effects than sex-based ones in borderline cases. Crucially, we identify a high intrinsic instability ("noise floor"; 0.46–0.71 Jaccard instability) in clinical LLM generations, while additional matched-metric analyses show that demographic perturbations are often comparable to non-demographic baseline variability.These findings highlight a key limitation of existing fairness evaluations: without establishing appropriate baselines, apparent demographic sensitivity may be over- or mis-attributed to bias rather than broader generative instability. We argue that baseline-aware counterfactual audits, which explicitly compare demographic effects against intrinsic model noise, provide a more reliable lens for evaluating clinical NLP systems in high-stakes settings.

CoreELM: An Open-Source Framework for Aligning Large Language Models to Embedding Spaces
Brian Ondov | Chia-Hsuan Chang | Yujia Zhou | Mauro Giuffrè | Hua Xu

Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we develop an open-source, domain-agnostic framework for aligning Large Language Models to embedding spaces using the recently reported Embedding Language Model (ELM) method. We demonstrate our framework by training models to recover, summarize, and compare clinical trial abstracts from embeddings alone. In addition to inverting embeddings back to text more reliably than existing methods, our models can decode novel, interpolated embeddings into new clinical trial abstracts that human experts cannot distinguish from real ones. We further show that these generated abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.

Uncertainty-Aware Multi-Label Routing of Clinical Text to Surveillance Pathways
Agathe Zecevic | Sebastian Zeki | Angus Roberts

Clinical decision support systems that operate across multiple downstream care pathways must first determine which pathway or pathways are relevant for a given patient. We study this routing problem in gastrointestinal surveillance, where paired endoscopy and histopathology text reports may indicate multiple concurrent conditions and therefore require multi-label routing. In this context, standard hard-label evaluation can be insufficient: a model may achieve reasonable overall performance while still excluding clinically important pathways when uncertain. We formulate gastrointestinal report routing as a multi-label uncertainty-aware classification task over six pathway labels and compare lightweight lexical baselines, frozen embedding models and a fine-tuned transformer baseline under two complementary uncertainty mechanisms: threshold-based abstention and set-valued conformal prediction. Using 1,773 paired reports from a single NHS trust with disjoint train, calibration and test splits, we evaluate both hard-routing performance and the downstream review burden introduced by uncertainty-aware prediction. The fine-tuned ClinicalBERT model achieved the strongest overall performance (0.811 subset accuracy, 0.861 macro-F1) and the lowest AURC of 0.084 under min-margin abstention. Threshold-based abstention consistently reduced exact-match routing error on accepted reports. For conformal routing at ?=0.10, Mondrian calibration achieved high mean positive-label recall coverage across learned baselines (0.883-0.917). The fine-tuned model achieved 0.891 mean recall coverage with a mean prediction set size of 1.70, 0.642 candidate-label precision and 0.61 false-positive labels per report. Compared with a recall-tuned threshold baseline at similar recall, Mondrian CP produced smaller candidate sets, higher candidate-label precision and fewer false-positive pathway suggestions. These results show that uncertainty-aware evaluation exposes clinically important failure modes missed by aggregate metrics. They also show that high-recall routing is not cost-free: set-valued prediction can reduce missed-pathway risk but must be interpreted as candidate generation for downstream review rather than automated pathway selection.

MedCAT v2: a modular, extensible architecture for clinical named entity recognition and linking under real-world privacy and compute constraints
Mart Ratas | Thomas Searle | Adam Sutton | Richard Dobson

MedCAT is an open-source framework for clinical named entity recognition and linking (NER+L) widely used in research and healthcare settings. We present MedCAT v2, a re-engineered version designed to improve modularity, extensibility, and maintainability while preserving the core functionality and performance of previous releases. The new architecture introduces a registry-based component system and a flexible pipeline that enables easy substitution of components, integration of alternative methods, and future expansion, including support for pre-trained components across the full NER+L and contextualisation workflow. This enables systematic exploration of clinical NER+L design trade-offs by evaluating different components in the pipeline. Evaluation across multiple public datasets shows equivalent or improved performance compared to earlier versions, with reduced integration overhead and improved runtime flexibility. The framework also supports optional extensions such as meta-annotation, relation extraction, providing a unified and reproducible environment for clinical NLP in real-world settings.

Effects of Adaptive Pretraining in Specialized Domains for Named Entity Recognition
Jack Lynam | Sam Henry

Due to unique concepts, syntactic structure, and vocabulary of specialized domains, it is common to train specialized Language models (LMs) for their target domain. For example, BioClinicalBERT is a specialized LM designed for clinical applications. These specialized LMs are typically created starting with a foundation model (such as BERT-base) which has been pretrained for the general English domain, and then adapted to the target domain via additional pretraining. Alternatively, LMs may be pretrained from scratch on data from the target domain. Both techniques are extremely computationally expensive and as such, these specialized LMs are often publicly released for other researchers. For some domains, such as the biomedical domain there are many, similar models available, and as a developer, this raises the question, which pretrained LM should I choose? Alternatively, in novel domains for which no specialized LMs exist, it raises different questions: Is it worth the cost to pretrain a LM from scratch? Should I adapt a general English model instead? Should I just use a general English model without adaptive pretraining? This is a particularly salient question when considering a limited budget. i.e. Should I pay for compute time or for annotators to create a larger dataset. In this paper we compare results of nine LMs across nine datasets spanning the clinical, scientific, and biomedical-related social media domains. From these comparisons we make several conclusions that can simplify the hyperparameter-tuning process and inform researchers and developers in novel domains. Broadly, these are that the effects of adaptive fine-tuning are small. If an adapted model exists in your domain, choose the one most closely related to your task. If no model exists, using a foundation model is likely sufficient.

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA
Ikram Belmadani | Oumaima El Khettari | Carlos Ramisch | Frederic Bechet | Richard Dufour | Benoit Favre

The development of large language models (LLMs) has led to increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling
Ying-Jia Lin | Tzu-Chin Lo | Ping-Chien Li | Chi-Tung Cheng | Chien-Hung Liao | Hung-Yu Kao

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label prompt-tuning approach for radiology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

Diagnosing Lower Extremity Arteriovenous Diseases Using Agentic LLMs
Zicen Liao | Yunhao Sun | Matthew Purver

This paper introduces LEA-Dialog, a multi-turn diagnostic dialogue dataset for lower-extremity arteriovenous diseases, together with a carefully developed diagnostic handbook and a process-aligned agentic framework for structured outpatient diagnosis. The dataset provides stage annotations for each turn and guideline-grounded probability trends, enabling evaluation beyond final diagnostic accuracy. Experiments show that the framework improves reasoning stability and reduces drift across both online and offline LLMs, with particularly large gains for smaller offline models.

KGRxn-LLM: Knowledge Graph Enhanced Large Language Models for Molecular Reaction Reasoning
Weichen Liu | Qiyao Xue | Yuyang Wu | Olexandr Isayev | Natasa Miskov-Zivanov

Large language models (LLMs) demonstrate strong general language capabilities but remain limited in chemical reasoning, particularly for tasks requiring structured, mechanistic understanding of molecular reactions. We present Knowledge Graph Reaction LLM (KGRxn-LLM), a framework that augments LLMs with a hierarchical chemical knowledge graph (KG) to ground reasoning in molecular transformations and reaction patterns. Existing benchmarks primarily emphasize reaction or molecular fact recall, providing limited assessment of reaction-level mechanistic reasoning. To address this gap, we introduce KGRxn-Bench, a benchmark of 1,200 questions designed to evaluate LLMs on reaction-centric reasoning tasks, including functional group identification, reaction type classification, and product and reagent prediction. Experimental results show that our approach of grounding LLMs in structured KG substantially improves performance across multiple tasks and model backbones, outperforming domain-specific fine-tuned models on KG-covered splits and most hold-out splits.

MAX-EVAL-11: A Large Scale Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding
Ujjwal Singh | Sarthak Deshwal | Nitish Dube | Arjun Sharma

The global transition to the ICD-11 taxonomy demands robust automated medical coding, yet comprehensive benchmarks to evaluate Large Language Models (LLMs) on this task remain absent. We introduce MAX-EVAL-11, the first large-scale benchmark for full-spectrum ICD-11 medical coding. MAX-EVAL-11 comprises 10,000 MIMIC-III discharge summaries with mapped, expert-validated ICD-11 annotations spanning 99.87\% of the diagnostic taxonomy. To better reflect clinical utility, we propose a novel hierarchical evaluation framework that assigns partial credit based on ICD-11’s 5-level structure, addressing the brittleness of traditional exact-match metrics. Our evaluation of state-of-the-art LLMs reveals significant performance gaps. The best-performing model (Claude 4 Sonnet) achieves a weighted score of 0.433, outperforming both general-purpose peers and specialized medical models (MedCoder). Crucially, all models exhibit near-zero exact match rates (0?4.8\%) and rely primarily on hierarchical credit, underscoring the extreme difficulty of precise ICD-11 code generation. Furthermore, the superiority of general-purpose LLMs over legacy ICD-10 medical models (with ICD-11 codelist) suggests that broad reasoning capabilities currently outweigh domain-specific training for complex taxonomy scaling.

Trustworthy NLP for Low-Resource Languages: Agent-Based Uncertainty Modeling for Hebrew Radiology Report Structuring
Hadas Ben Atya | Naama Gavrielov | Zvi Badash | Gili Focht | Ruth Cytter-Kuint | Talar Hagopian | Dan Turner | Moti Freiman

Reliable extraction of structured information from radiology reports using Large Language Models (LLMs) remains a significant challenge, particularly for complex, non-English texts such as Hebrew. This study proposes an agent-based, uncertainty-aware framework to enhance the reliability and interpretability of LLM predictions in clinical contexts. A total of 9,683 Hebrew radiology reports from Crohn’s disease patients (2010?2023) across three medical centers were analyzed. Of these, 512 reports were manually annotated for six gastrointestinal organs and 15 pathological findings, while the remainder were automatically labeled using HSMP-BERT. Structured data extraction was performed with Llama 3.1 (Llama 3-8b-instruct) employing Bayesian Prompt Ensembles (BayesPE), which utilized six semantically equivalent prompts to quantify uncertainty. An Agent-Based Decision Model aggregated prompt outputs into five calibrated confidence levels and was benchmarked against three entropy-based approaches. Model performance was assessed using accuracy, F1 score, precision, recall, and Cohen’s Kappa before and after filtering high-uncertainty cases. The agent-based model outperformed all baselines, achieving an F1 score of 0.3967, recall of 0.6437, and Kappa of 0.3006; after excluding cases with uncertainty = 0.5, the F1 score increased to 0.4787 and Kappa to 0.4258. The proposed framework improves uncertainty calibration and predictive reliability, advancing the safe deployment of LLMs in medical data extraction.

Treating Decoder-Only LLMs as Encoders: A Simple and Effective Fine-tuning Approach for Named Entity Recognition
Ken Yano | Hiroya Takamura

NER requires token-level classification using both left and right context, which makes encoder-only models like BERT naturally well-suited for the task. Decoder-only LLMs, by contrast, use causal masking during training, so their token representations lack right-side context, limiting their effectiveness on structured prediction tasks like NER despite their strong general capabilities. To address this, the authors propose fine-tuning decoder-only LLMs with causal attention replaced by full attention, combined with label-supervised discriminative training. While similar ideas exist in prior work, those studies were limited in scope. This work evaluates seven LLMs across four model families (Gemma, Qwen2.5, Llama3.1, Llama3.2) and compares full fine-tuning against LoRA. Results show that the proposed approach with an appropriate LoRA configuration outperforms encoder baselines (BERT, RoBERTa, DeBERTa), and achieves strong NER performance without auxiliary data or architectural modifications, though it does not reach SOTA on BC5CDR and CoNLL2003.

A Multi-View Framework for Cross-Domain Nutrition Misinformation Detection in Social Media
Vishwaa Shah | Indika Kahanda | Andrea Arikawa | Asal Abbaszadeh | Richard Loftis

Nutrition misinformation on social media often arises from selective interpretation of scientific evidence rather than outright falsehoods, making it difficult to detect. We introduce a curated, expert-annotated Instagram dataset focused on seed oils and omega-6, two domains characterized by contested dietary claims. We evaluate feature-based, embedding-based, and transformer-based models under in-domain and cross-domain settings. Results show strong in-domain performance across all models, with Sentence-BERT achieving the highest AUPRC (up to 0.96). However, performance drops substantially under cross-domain transfer, indicating limited robustness to topic shift. Analysis suggests that while contextual embeddings capture strong in-domain semantic signals, linguistically and psychologically grounded features are more stable under distribution shift. These findings highlight the value of combining semantic and interpretable linguistic signals for robust misinformation detection.

Ontological Validation of Biomedical Topic Models: SNOMED CT Hierarchy Distance as an Automated Evaluation Metric
Ilan Rubinfeld | Sami Zaidi | Milosh Djuric | Loay Kabbani | Mouhammad Halabi | Alex Shepard

Standard coherence metrics for biomedical topic models encode no clinical knowledge and cannot detect clinically implausible topic groupings. We propose SNOMED CT Wu?Palmer hierarchy distance as a post hoc, ontology-grounded diagnostic. On vascular surgery (47,318 articles) and craniofacial surgery (27,493 articles) corpora, the metric flags clinically heterogeneous topics that coherence misses?e.g., abdominal aortic aneurysm repair grouped with deep vein thrombosis (d = 0.600). Diagnostic signals are nearly identical across eight BERTopic embedding strategies including ontology-enhanced models, but diverge across model families: BERTopic alone produces a positive within- vs. cross-topic Cohen’s d, while LDA, NMF, and Top2Vec at matched topic counts score below their own cross-topic baselines (Cohen’s d 0; Mann?Whitney p 0.99). The score is therefore sensitive to topic-model output choice, not only to embedding choice within a single pipeline. A pre-clustering screening experiment finds near-zero correlation (|?| 0.08) between embedding cosine and SNOMED CT similarity, arguing that ontological validation belongs after clustering rather than as an embedding screen. We additionally describe a two-stage UMLS-CUI stopword filter that preserves high-frequency domain-specific concepts which naive frequency filtering would discard. After one-time concept curation, the diagnostic itself is automated and requires no per-topic expert scoring.

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale
Jinghui Liu | Sarvesh Soni | Anthony Nguyen

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect – such as similarity or utility comparisons – even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes – despite their task-agnostic nature – can effectively augment task-specific training for rare ICD codes.

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes
Jinghui Liu | Anthony Nguyen

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

EmCellLLM: Human Peri-Implantation Embryonic Cell Annotation Based on Large Language Models
Xiaorui Guo | Zhiwei Liu | Qianqian Xie | Sophia Ananiadou

The advent of single-cell RNA sequencing has enabled unprecedented resolution of cell fate decisions and regulatory mechanisms during peri-implantation human embryogenesis, in which accurate cell type annotation is a fundamental prerequisite and the first step for subsequent fate and mechanism inference. Large language models (LLMs) have demonstrated outstanding performance in various fields. However, current studies mostly rely on traditional methods and have not explored the application of LLMs in the field of human embryonic cell annotation. The main reason is the lack of instruction tuning datasets and evaluation benchmarks. In this paper, we proposed EmCellLLM, the first open sourced LLMs that are specialized for human embryonic cell type prediction task based on fine-tuning Qwen3-8B with EmCell4Instruction, the first embryonic cell type prediction instruction dataset. To support LLM instruction tuning, we also build EmCellBench, the first benchmark for evaluating human embryonic cell type prediction ability of LLMs. We compare our models with a variety of LLMs on EmCellBench, where our model outperforms all other open-sourced LLMs as well as DeepSeek.

Randomized Controlled Trials as the Gold-Standard for Evaluating LLMs: A Primer for Biomedical NLP Researchers
Vicente Ivan Sanchez Carmona | Shanshan Jiang | Bin Dong

Large Language Models (LLMs) are no longer mere laboratory objects of study. LLMs have become everyday tools in society across diverse populations and domains. In clinical contexts, LLMs have already been devised as clinical support applications. However, along with benefits, negative or adverse effects might arise, such as LLMs potentially providing psychologically distressing advice to adolescents when used for mental health support. This raises questions on the benefits of LLMs and calls for real-world evaluations: Are LLMs really helpful and effective for the intended purposes people are using them or will use them for? To answer this type of question we propose to use Randomized Controlled Trials (RCTs). RCTs are considered the most strict experimental design in the fields of Medicine, Psychiatry, Psychology, among others; however, the use of RCTs in the NLP field is almost negligible. In spite of the NLP field being the de facto locus of research on LLMs, other fields, prominently Medicine, are leading the RCT evaluations on LLMs. In this primer paper, we present a concise introduction to the principles of RCTs to guide NLP researchers to design RCT studies for evaluating LLMs.

Citation-Aware Continual Pre-Training for Biomedical Language Models
Masaki Asada | Tomoki Tsujimura | Tatsuya Ishigaki | Shusaku Egami | Ken Fukuda | Hiroya Takamura

The biomedical literature contains rich structured knowledge, including citation links that encode relationships between scientific studies, but such information is typically ignored in standard language model pre-training. We propose a citation-aware continual pre-training method for decoder-only language models that incorporates citation graph information from PubMed into next-token prediction by placing citation-linked abstract pairs within a shared context. We evaluate our method on multiple biomedical QA benchmarks using two model families. Results show that citation-aware continual pre-training achieves higher average accuracy than both the original base models and citation-unaware pre-training across biomedical tasks.

TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Medical Knowledge in Open Large Language Models
Ioana Buhnila | Aman Sinha | Mathieu Constant

While humans can easily produce various types of answers, such as definitions, examples or paraphrases, Large Language Models (LLMs) struggle to provide correct answers to medical questions that require diverse answer formats. In this paper, we introduce TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also propose RefoMed-EN, a medical dataset consisting of 6,170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We investigated whether the high or low frequency of a concept (head or tail knowledge) impacts the language model’s performance for answering medical questions. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s answer quality for definition-type questions is the highest, while for the exemplification-type being the lowest. Additionally, we showed that for definition-type medical questions ("What is multiple sclerosis?"), LLMs are prone to paraphrase more for popular medical concepts, and less on more specialized medical knowledge.

Discharge Instructions are not One Task: Grounding Differences Between Surgical and Non-Surgical Admissions
Mayank Jobanputra | Justin Xu | Samarth Oza | Hulma Naseer | Yifan Wang | Blerta Veseli | Chandralekha Kona | Haochen Cui | David Eyre | Vera Demberg

Discharge instructions are patient-facing, safety-critical documents that guide medication use, follow-up care, and recovery after hospitalization. Because they must synthesize information across the clinical record and often include post-discharge guidance not stated verbatim in the EHR, they are a difficult target for clinical text generation. In this work, we study discharge instructions in MIMIC-IV through a grounding-first lens. Using two LLMs, we decompose each discharge instruction into medically relevant statements and verify them against the Electronic Health Record (EHR). We find that discharge instructions for Surgical admissions are much longer, averaging roughly 24–25 statements per admission versus 11–12 in Non-Surgical cases, while supported content remains similar in absolute count. The additional Surgical content is dominated by statements that are not directly stated in the record or require clinically plausible extrapolation. Through this analysis, we advocate for better grounding and completeness evaluations at a fine-grained level, establishing a foundational step toward safer and more reliable discharge-instruction generation.

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature
An Dao | Nhan Ly | Thao Tran | Yuji Matsumoto | Akiko Aizawa

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/

Evaluation of Multilingual Text Simplification for the Mental Health Domain: Exploring Small Language Models
Olga Pelloni | Sandra Just | Lars Bongo

Individuals with particular mental health disorders may find it difficult to learn about their own condition. Therefore, efforts have been made to create materials that explain complex medical information in simpler words, which are also beneficial for caregivers and others. However, text simplification is commonly done in English and only sporadically in other languages. In this study, we explore potential ways for language-agnostic medical text simplification for the mental health domain. Our approach is to simplify the ICD-11 articles on primary psychotic disorders in English, German and French, using small LMs and various metrics for evaluating different aspects of the texts: lexical complexity and readability. Our results show that acceptable texts were produced only in English, and that a joint analysis of Measure of Textual Lexical Diversity (MTLD) and Flesch Reading Ease (FRE) provides the most insight, capturing both the best outcomes and signaling different types of issue. The study is preliminary and requires further investigation.

BioTopicXplor: A Web Tool for Interactive Exploration of PubMed Literature through Reproducible Topics.
Lana Yeganova | Donald Comeau | Won Kim | Natalie Xie | Shubo Tian | W John Wilbur | Zhiyong Lu

The rapid growth of biomedical literature presents a major challenge for organizing knowledge and identifying emerging research trends. While PubMed provides effective access to relevant articles, it does not support understanding the conceptual structure of document collections. Existing tools rely on predefined features, ontologies, or parameter-sensitive clustering methods, limiting their ability to uncover fine-grained, data-driven topics in a reproducible manner. We present BioTopicXplor, an on-demand web server for interactive exploration of biomedical literature derived from arbitrary PubMed queries. The system integrates ConvexTopics, a convex optimization?based topic modeling framework that guarantees convergence to a global optimum and eliminates the need for predefined parameters. This enables the generation of reproducible and fine-grained topic structures across large document collections. Given a PubMed query, BioTopicXplor retrieves relevant articles, performs topic discovery, and organizes the resulting subtopics into a hierarchical structure of higher-level themes. To enhance interpretability, the system incorporates large language models to generate concise, literature-grounded summaries and descriptive titles for each topic, with links to supporting evidence. We demonstrate the utility of BioTopicXplor through a case study on anti-aging research, where the system reveals meaningful thematic structures and supports knowledge discovery.

Reading Between the Lines: Toward Translating Verbose Patient-authored Messages into Clinician-Formulated Questions
Sarvesh Soni | Madeline Bittner | Dina Demner-Fushman

Patient portal messages often embed clinical questions inside long, emotionally nuanced narratives, requiring clinicians to infer the underlying information need. We study the task of rewriting verbose patient-authored narratives into concise, clinician-interpreted questions framed as if querying an electronic health record (EHR) system. We evaluate a lightweight LLM-based rewrite pipeline that constrains outputs to 10-15 words and uses rule-based validation with regeneration. We test the approach on 140 distinct patient questions drawn from the ArchEHR-QA dataset and shared task. Each system output is double-annotated by two annotators for quality (Good/Ok/Bad) and error types (Generic, Malformed, Tangential, Hallucination). Results show that while models follow output constraints, they often produce overly generic or tangential questions, and occasional hallucinations introduce unsupported clinical details. Across both clinician-question and patient-narrative comparison settings, automatic metrics show substantial overlap across human quality labels; in pairwise meta-evaluation, BERTScore is the strongest proxy for human preferences. We release our code and annotations to support future work.

Investigating Stigmatizing Language in Clinical Documentation with Open-Source Large Language Models
Rajashree Dahal | Pardis Hosseinpour | Pranithi Kamishetty | Satwik Pamulaparthy | Saeid Tizpaz-Niari | Natalie Parde

Clinical documentation is essential for patient care, billing, and medical research, but it is subject to entrenched bias. While manual chart reviews can identify such bias, they are labor-intensive and expert-dependent. We introduce and evaluate StigMAD, a Multi-Agent Debate framework leveraging open-source Large Language Models (LLMs) to detect stigmatizing language in clinical documentation. We investigate reasoning (multi-agent debate), self-reflection, and self-consistency within this framework. Extensive experiments on clinical notes and patient summaries demonstrate that our framework provides significant advantages over rule-based and supervised baselines. A domain-specific LLM (MedGemma) achieved its highest performance using the StigMAD reasoning framework, while a general-purpose LLM (Llama) showed superior results with the self-consistency framework. These findings suggest that open-source LLMs, steered by structured prompting and reflective reasoning, can effectively support the scalable auditing of stigmatizing language, marking a critical step toward more equitable clinical NLP systems.

Learning to Combine AI Annotations for Improved Biomedical Relevance Labeling
Won Gyu Kim | Lana Yeganova | Shubo Tian | Donald Comeau | W John Wilbur | Zhiyong Lu

Accurate labeling of relevance between biomedical abstracts is essential for improving information retrieval, semantic similarity modeling, training of ranking systems and other Natural Language Processing tasks. However, manual annotations are time-consuming, labor intensive and costly. Studies show that large language models (LLMs) can facilitate automated annotation, but their performance still falls short of human expert-level accuracy, especially in domain-specific tasks. It has been shown that combining annotations from multiple non-expert annotators can achieve performance comparable to, or even exceeding, that of trained experts. Based on this evidence, we treat AI-generated annotations as contributions from non-expert annotators and combine them using Learning to Rank framework. Our results show significant improvement in overall annotation quality. The proposed method looks promising to reduce reliance on human annotation while maintaining reliable performance for large-scale biomedical applications.

When Does Retrieval Beat Direct LLM Diagnosis in Rare Disease? An Empirical Study of Ontology Coverage
Mohamed Elmofty | Ulf Leser

Recent high-complexity agentic systems such as DeepRare perform strongly on rare disease diagnosis benchmarks, but it remains unclear when gains come from structured knowledge access and when they come from parametric LLM knowledge. We compare phenotypebased retrieval, LLM reranking, and unrestricted LLM diagnosis across seven benchmarks covering 10,382 cases. We find a clear performance crossover driven by retrieval coverage?the fraction of cases whose true diagnosis is within the retriever’s top-50: on highcoverage datasets, ontology-based retrieval dominates; on low-coverage datasets, openended LLM diagnosis takes the lead. Building on this, adding an LLM reranker over retrieved candidates further improves accuracy across our patient-case benchmarks, closing most of the remaining gap to agentic systems (within 2 pp on MME and LIRICAL). We trace the crossover to two structural failure modes of ontology-based retrieval?annotation sparsity and phenotypic homogeneity?and show that aggregate scores across mixed benchmarks can hide these qualitatively different diagnostic settings. These findings motivate per-dataset evaluation and hybrid diagnostic systems that combine retrieval, reranking, and parametric LLM generation based on case characteristics.

BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
Nourah Salem | Elizabeth White | Michael Bada | Lawrence Hunter

Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries.

A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports
Abdulrahman Aal Abdulsalam | Adhari Al Zaabi | Riham Jeeballah | Habiba El Keraby

Extracting structured cancer registry information from pathology and medical reports is challenging due to heterogeneous reporting styles and implicit clinical reasoning. We propose a modular multi-agent framework that decomposes registry abstraction into semantic chunking, retrieval, field-specific extraction, validation, evaluation, and aggregation stages. The dataset includes 818 annotated cancer cases from Sultan Qaboos University Hospital. Evaluation in this study focuses on breast (n=454) and colorectal (n=174) reports across grade, morphology, TNM staging, and laterality extraction tasks. The framework is compared against prompt-based LLaMA 3.3 baselines using accuracy and weighted/macro F1-score metrics. The proposed framework improved performance in context-dependent tasks, particularly grade extraction, where weighted F1-score increased from 0.71 to 0.78 for breast cancer and from 0.56 to 0.67 for colorectal cancer. Improvements were also observed for colorectal laterality extraction. For other extraction tasks, particularly highly structured tasks such as TNM staging and morphology extraction, the multi-agent framework achieved performance comparable to direct prompting. Although the baseline achieved slightly higher average weighted F1-scores overall, the proposed framework provides improved modularity, traceability, and pipeline-level interpretability through explicit intermediate reasoning stages, supporting error analysis and future clinician-guided refinement.

BioConflict: A Benchmark for Evaluating Large Language Models in Biomedical Contradiction Detection and Consensus Synthesis
Ashwin Kirubakaran | Henry Gagnier

Resolving contradictions in biomedical literature requires more than factual recall; it demands identifying the hidden variables that explain divergent findings. Existing NLI benchmarks such as MedNLI operate at the sentence level and fail to capture document-level conflicts driven by differences in dosage, cell type, or study design. We introduce BioConflict, a benchmark of 250 expert-annotated paper pairs (500 abstracts) across ten biomedical topics, formalizing three tasks: conflict detection, contextual variable extraction, and consensus synthesis. We evaluate five general-purpose large language models and two domain-specific baselines, finding that general-purpose large language models achieve strong conflict detection (F1 up to 0.89) but exhibit brittle reasoning in synthesis, while domain-specific models lag significantly on all generative tasks. These findings highlight the need for context-aware biomedical AI capable of resolving, not merely retrieving, conflicting scientific evidence.

Tokenization Granularity and Medical Term Representations in Language Models
Vojtech Lanz | Pavel Pecina

We investigate how tokenization granularity affects the representation of medical terminology in language models. Prior work links tokenization granularity to downstream performance under contextualized settings for specifically pretrained and fine-tuned models. We instead ask whether this relationship already emerges at the level of isolated term representations across existing pretrained models. We introduce an intrinsic definition retrieval task using UMLS term-definition pairs, with comparison to WordNet. We show that despite substantially heavier fragmentation of medical terminology, the models remain relatively robust in maintaining semantic alignment between medical terms and their definitions. At the same time, tokenization granularity still correlates with retrieval performance, indicating that effects previously observed in downstream biomedical tasks are already reflected at the level of isolated term representations. Encoder models benefit primarily from whole-token preservation, while for decoder LLMs, tokenization effects emerge mainly at deeper retrieval ranks.

CAP: A Source-Grounded Proposition Scaffold for Faithful Clinical Dialogue-to-Note Generation
Hyunkyung Lee | Jisoo Jung | Jeonguk Lee | Jaehyo Yoo | Wooseok Han | Minkyu Kim | Gibaeg Kim

Clinical dialogue-to-note generation is challenging because clinically salient evidence is noisy, distributed across turns, and often revised later in the encounter. Direct transcript-only prompting and coarse intermediate scaffolds can therefore suffer from omissions, section leakage, unsupported fill-in, and brittle final-state tracking. We propose Clinical Atomic Propositions (CAPs), a dialogue-aware intermediate representation for faithful clinical note generation. CAPs extract source-grounded clinical assertions while preserving modifiers such as verification status, temporality, speaker/source, and action type. We also study an optional event consolidation layer that groups CAPs into problem-oriented care bundles before note rendering. We evaluate five methods on a 197-case ACI-Bench cohort: a transcript-only baseline, prompt-based reimplementations of Cluster2Sent and MEDSUM-ENT, CAP, and CAP+Event. The main task uses a sectioned-note template, with SOAP-template rendering and transcript-free rendering reported as ablations. We use MEDSUM-ENT-style GPT-R/P/F1 metrics and a proposition-grounded semCAP-R/P/F1 audit to measure concept-level and source-grounded faithfulness, complemented by case-level win/tie/loss analysis and clinician deep review. Results show that CAP improves preservation of transcript-grounded clinical propositions while remaining competitive on concept-level GPT metrics. CAP+Event is not uniformly better than CAP, but qualitative and boundary analyses show when problem-oriented consolidation can improve organization and when compression can introduce omissions. We release code, prompts, intermediate representations, generated notes, and evaluation artifacts at a public repository.

Segmentation Matters: Exploring LLM-Based Strategies for Temporal Clinical Event Identification in Oncology Reports
Cristiano Bellucci | Francesco Madeddu | Chiara Iacomini | Carlotta Masciocchi | Stefano Patarnello | Massimo Bernaschi | Mario Santoro | Livia Lilli

Processing unstructured clinical narratives remains a major challenge in medical Natural Language Processing (NLP), particularly when critical information is embedded within lengthy and heterogeneous reports. Clinical notes often describe key diagnostic and therapeutic events through a verbose narrative, making automatic event identification difficult. In this work, we frame the identification of clinical events as a text segmentation task.We conduct a comparative study of three segmentation strategies applied to oncology reports: (i) a fully regex-based approach, (ii) a cascaded regex?LLM pipeline, and (iii) the same cascade architecture augmented with a recovery mechanism to mitigate LLM rephrasing. Segmentation quality is evaluated using complementary structural metrics (Pk, WindowDiff, Boundary Similarity, Segment Count Accuracy, and Text Overlap IoU), and its impact is also observed on downstream segment tagging, performed to identify the corresponding event type (e.g. surgery, biopsy, imaging, treatment, laboratory).The results demonstrate the high potential of LLM-based approaches, particularly in preserving semantic coherence within segments and generalization on new data sources. However, regex-based segmentation achieves higher performance according to structural segmentation metrics, also leading to better downstream clinical event identification. In general, these results highlight the critical role of context-adaptive high-quality segmentation strategies in the structuring of verbose clinical narratives and in the accurate identification of key patient events.

Operation-Mechanism Alignment for Reliable Clinical Reasoning over Electronic Health Records
Guanyu Tao | Siyao Wang | Yong Xue | Ashwani Tanwar | Yuting Ji | Kai Sun | Monica Mok | Marzana Chowdhury | Deepa Gupta | Ashok Gupta | Jingqing Zhang | Vibhor Gupta | Yike Guo

Clinical reasoning over electronic health records (EHRs) involves heterogeneous operations, including text interpretation, numerical computation, temporal filtering, and guideline-based aggregation. However, many existing LLM-based approaches still cast these heterogeneous operations as a single end-to-end generation process, obscuring their different reliability requirements and making intermediate failures difficult to inspect. We therefore propose a framework based on operation-mechanism alignment that represents clinical reasoning as a directed acyclic graph of typed operations, where each node is assigned to the execution mechanism best suited to its reliability requirements. The framework also preserves structured evidence provenance for intermediate results. Across six clinician-annotated binary decision tasks, the framework outperforms direct prompting, single-step retrieval-augmented prompting, and chain-of-thought baselines, supporting operation-mechanism alignment as a practical design principle for reliable clinical reasoning over EHRs.

MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique

Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
Yikun Han | Mengfei Lan | Halil Kilicoglu

Biomedical retrieval-augmented LLMs are often evaluated under helpful retrieved context, but in practice the evidence can also be misleading or internally conflicting. This paper studies uncertainty under those harder settings using the HealthContradict benchmark and six open-weight models. We evaluate five controlled evidence conditions: no context, correct-only context, incorrect-only context, and two mixed conditions that contain the same correct and contradictory documents in opposite orders. Correct evidence improves both accuracy and calibration, while incorrect evidence substantially degrades both. Under conflicting evidence, document order also matters: reversing the order of the same two documents changes 11.4%–25.2% of predictions and consistently reduces performance when the incorrect document appears first. We further evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, incorrect-only and incorrect-first conflict, this score improves selective accuracy over confidence-only abstention, with mean gains of 7.2–33.4 and 3.6–14.4 points across 75%, 50%, and 25% coverage. These results show that biomedical RAG systems should be evaluated not only under helpful retrieval, but also under misleading and conflicting evidence.

A Comparative Analysis of In-Context Learning and Fine-Tuning for Biomedical Information Retrieval and Sentence Extraction Using Research Domain Criteria
Athlene Jones | Khanh Lieu | Indika Kahanda

Research Domain Criteria (RDoC) is a National Institute of Mental Health framework for studying mental disorders by integrating information across genetics, circuits, and behavior. Manually curating biomedical abstracts relevant to RDoC is a significant challenge due to semantically overlapping construct definitions (e.g., "Acute Threat," "Potential Threat," and "Sustained Threat") and the exponential growth of biomedical literature. This study compares two modeling strategies, domain-adapted fine-tuning and in-context prompting, across two RDoC-related subtasks from the official BioNLP-OST 2019 RDoC shared task. For Task 1, unlabeled PubMed abstracts are retrieved and ranked by relevance to eight of the RDoC constructs. We compare a TF-IDF baseline against ModernBERT and Llama (zero-shot and five-shot) using Mean Average Precision (MAP). For Task 2, the objective is to identify the single most relevant sentence from an abstract for a given construct, evaluated using per-construct accuracy. The fine-tuning track performs end-to-end fine-tuning of BioBERT, PubMedBERT, ModernBERT, and RoBERTa using a cross-encoder input format and per-construct grid search. These are compared against the in-context learning of several open-source language models. Both our approaches are competitive against the best-performing team’s score from the BioNLP-OST 2019 RDoC shared task. Taken together, these findings suggest that five-shot prompted LLMs and domain-adapted fine-tuned transformers are viable tools for semi-automating the expert annotation in RDoC curation.

Clinical Evidence and Patient Reviews: A Linked Dataset for Antidepressant Side Effects
Steven Au

Clinical sources and patient-authored reviews often describe antidepressant side effects in different ways, but these differences are rarely measured directly. We present ClinPeer-AE, a linked dataset for comparing side-effect evidence from PubMed, ClinicalTrials.gov, WebMD, and Drugs.com while preserving source identity. Across five widely prescribed antidepressants, we find low overlap between clinical and peer sources, large differences in relative emphasis, and evidence that many peer-only effects also appear in U.S. Food and Drug Administration Adverse Event Reporting System (FAERS) reports. These findings suggest that patient reviews provide useful context about recurring medication experiences and offer a complementary view of how side effects are described outside formal clinical settings.

A Deterministic Multi-Stage Retrieval Pipeline for Longitudinal EHR Question Answering
Shubham Agarwal | Thomas Searle | Richard Dobson | Ninoslav Majkic | Niko Moller-Grell

Retrieval-augmented generation (RAG) holds promise for clinical question answering over electronic health records (EHRs), but existing systems treat retrieval as an opaque subroutine, limiting auditability and reliability in patient care workflows. We introduce a deterministic multi-stage retrieval pipeline for longitudinal EHR question answering that decomposes retrieval into four distinct, ablated stages where each stage is instrumented with diagnostic metrics, making the flow of clinical evidence measurable and auditable at every step. Evaluated on a broad LLM-annotated cohort and an expert-annotated cardiovascular benchmark developed alongside clinicians from real ICU records, the full pipeline achieves 22-23% relative recall gain over a strong dense retrieval baseline across both cohorts, with consistent improvements in downstream answer quality. The pipeline’s deterministic and transparent design addresses a critical gap in clinical NLP: retrieval systems that clinicians and researchers can not only rely on, but inspect, audit, and build upon for real-world deployment.

Interpretable ICD Code Classification with Faithful Sentence Extraction
Yichen Wang | Lian Hong | Masato Mizogaki | Shunnosuke Umeda | Toshimune Kenmotsu | Akihiro Tamura | Daniel Andrade

Transformer-based models such as PLM-CA achieve strong performance for automatic ICD coding, but their attention weights do not provide faithful explanations of their predictions. This is a major limitation for electronic medical records, where users often need concise and trustworthy evidence for each assigned code. To address this issue, we jointly train a sentence extractor and an ICD code classifier such that predictions are based only on the extracted sentences. As a result, the extracted sentences serve as faithful rationales for each predicted code and substantially reduce the effort required to inspect long medical records. Experiments on MIMIC-III show that our method approaches the performance of a transformer baseline that processes the full record while using only a small fraction of the document.

Evaluating LLM-as-a-Judge for Medical Term Simplification
Ioana Buhnila | Aman Sinha | Rohit Agarwal | Dilip K. Prasad | Mathieu Constant

Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.

FACT: Functional Group Alignment and Consistency in Token Space for Structure-aware Molecular Representation Learning
Hyeonyeong Nam | Woojae Choi | Deok-Joong Lee | Young-Han Son | Sangwoon Lee | Bogyeong Kang | Eunjung Jo | Tae-Eui Kam

Molecular representation learning aims to capture chemically meaningful structures for various downstream tasks such as accurate molecular property prediction. However, incorporating functional group (FG) information into SMILES-based models remains challenging. The absence of explicit alignment between graph-defined FG atom sets and tokens in sequence prevents complete substructure masking, while multiple valid SMILES forms of the same molecule lead to inconsistent FG representations in token space. To address these challenges, we propose FACT (Functional Group Alignment and Consistency in Token Space), an end-to-end framework for structure-aware SMILES-based representation learning. FACT introduces an atom?token alignment module for complete FG span masking during pre-training and enforces FG consistency across different SMILES forms during fine-tuning. Experiments on MoleculeNet benchmarks show that FACT achieves state-of-the-art or competitive performance on eight tasks, demonstrating the effectiveness of alignment and consistency learning for molecular representation.

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization
Gaurav Kumar

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs; Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B; via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA achieves higher F1 than both GPT-4o and GPT-5 (up to 12% gain) at 44.5x lower cost using just 1,008 training examples, representing a compelling cost-quality trade-off. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference
François Remy

Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document–query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.

Developing Literature Annotation Guidelines for Representing Normal Physiology in Biolink-Compatible Knowledge Graphs
Madeline Bittner | Willie Rogers | Dina Demner-Fushman | Richard Scheuermann | Matthew Diller

Much of our knowledge about anatomy and physiology is found in text format in research papers and medical textbooks. For an information system to have access to this knowledge, extracting and translating it into a computable format that can be stored in an ontology or knowledge graph is advantageous. Unfortunately, existing text mining corpora, which are needed to train and evaluate data mining models, are old and consist almost entirely of research papers, which rarely contain complete information needed to capture complex normal physiological processes and, subsequently, understand the pathophysiology of a disease. As a first step to filling in this gap, we have developed a guide for annotating medical textbooks for physiological events and entities involved in these events. In addition to providing our guidelines and describing the guideline development process, we analyze the coverage of normal physiology in existing ontologies.

CENT: Context Engineering Framework for Normalization of Clinical Trial Procedures
Sanya Taneja | Ziqing Ji | Hans Verstraete | Ali Samadani

Clinical Concept Normalization is essential for clinical research applications involving trial protocols, such as patient-trial matching. Existing approaches focus heavily on specific domains and need large, annotated datasets. To address these challenges, we propose CENT, a context engineering framework that combines semantic matching for candidate retrieval and Large Language Model (LLM) prompting for disambiguation. We applied CENT on a publicly available dataset of procedures normalized to Current Procedural Terminology (CPT) concepts and evaluated the framework using both binary and hierarchical metrics that take into account hierarchical characteristics of predicted codes. CENT achieves superior performance on clinical procedures normalization in both binary and hierarchical metrics compared to semantic matching or LLM-only approaches, without requiring fine-tuning. Advanced prompt strategies, including Chain-of-Thought and Tree-of-Thoughts, achieve high performance at practical cost. We further applied CENT to predict codes in two clinical protocol-derived datasets to validate its performance on noisy procedure texts. CENT is scalable and adaptable for normalization across diverse clinical vocabularies in real-world clinical applications.

Agentic AI Architectures for SOAP Note Generation
Keno Hanken

Clinical documentation places significant time demands on medical professionals, consumes institutional resources, and is prone to errors that may compromise patient care. Recent advances in LLMs offer promising approaches for automating clinical note generation; however, the impact of different AI architectural designs remains underexplored, particularly for agentic AI systems. This study compares three architectures ? single-LLM, multi-agentic, and swarm-agentic ? for automated SOAP (Subjective, Objective, Assessment, Plan) note generation from doctor?patient dialogues. All approaches employ QLoRA-finetuned Ministral 3 models (3B and 8B parameters) trained on the MedSynth dataset, comprising 10,030 dialogue?note pairs across 2,006 ICD-10 code classes. Performance is evaluated using ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore against a lexical-overlap baseline (dialogue vs. ground-truth SOAP, no inference). Results show that all finetuned models substantially outperform the baseline, while differences between architectural variants remain marginal. The single-LLM setup achieves the strongest performance across all metrics; 3B and 8B variants perform nearly identically on semantic similarity (BERTScore), while ROUGE differences are small but statistically significant. Qualitative inspection further reveals that residual differences across architectures are driven primarily by shared dataset priors rather than by architectural reasoning capacity. The results are based on synthetic data without human evaluation and reflect architectural behavior only.

VERICITE: Evaluating Sentence-Level Citation Faithfulness in Retrieval-Augmented Medical Question Answering
Yixian Ma | Bohao Chu | Norbert Fuhr

Retrieval-augmented generation (RAG) reduces hallucination in large language models by grounding outputs in retrieved evidence, but it does not guarantee that the resulting citations support the associated claims. We present VERICITE, a framework for evaluating citation faithfulness in retrieval-augmented medical QA. Our system retrieves PubMed abstracts via the NCBI E-Utilities API, prompts LLMs to generate answers with inline citations, and verifies each citation at the sentence level using a DeBERTa-v3-large NLI model. We evaluate four LLMs on 500 BioASQ questions at retrieval depths of 3 and 5, with extended experiments up to k = 15 and an oracle setting with gold standard documents. Only 27?41% of citation pairs are supported at the sentence level at retrieval depths of 3 and 5, with support rates declining further at larger k. Under the oracle condition, answer quality improves, but citation faithfulness does not substantially improve, suggesting that generation-side citation behavior contributes substantially to unfaithful citations.

Overview of the Medical Decision Extraction, Analysis, and Classification Task (MedExACT) of BioNLP 2026
Mohamed Elgaar | Jiali Cheng | Nidhi Vakil | Mehrnaz Sadrolashrafi | Mitra Mohtarami | Adrian Wong | Hadi Amiri | Leo Celi

This paper presents an overview of the Medical Decision Extraction, Analysis, and Classification task (MedExACT) of BioNLP 2026. The focus of this task is the extraction and labeling of medical decisions in ICU discharge summaries. The task is built on MedDec, a MIMIC-III-based dataset of 451 expert-annotated summaries, and asks systems to extract and classify spans of text that contain medical decisions according to the decision categories defined in the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM). The official ranking combines span F1 and token F1 with a worst-group robustness metric computed over sex, race, and English-proficiency subgroups. MedExACT attracted broad international interest, with 130 official submissions from 36 teams comprising about 60?100 participants, and has improved information extraction performance by nearly 15% over the previous state of the art. The submitted systems predominantly use long-context encoder models, ensemble decoding, boundary-refinement modules, and robustness-aware training or model selection, with the best submitted run reaching a final fairness-based F1 of 0.596.

Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation
Sylvey Lin | Joseph Menke | Shufan Ming | Dongin Nam | Neil Smalheiser | Halil Kilicoglu

Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
Fabien Maury | Solène Grosdidier | Maud De Dieuleveult | Adrien Coulet

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domain-specific complexity poses challenges for generalist models.In this work, we focus on the domain of autoimmunity where the main entities of interest are autoimmune diseases, autoantibodies (i.e. molecules that may mark or cause these diseases), their molecular targets, their location in the body, and the associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed that we manually annotated for those entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and second, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after fine-tuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at: https://github.com/f-maury/AAbAAC .

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
Robert Leaman | Rezarta Islamaj | Zhiyong Lu

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train?test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Towards Grounded Hallucination Definitions for Biomedical Question Answering with Reproducible Examples from ClinIQLink
Brandon Colelough | Davis Bartels | Madeline Bittner | Dina Demner-Fushman

Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.

Can NLP Models Detect When One Publication Outweighs Twenty? Predicting Systematic Review Conclusion Changes
Ebrahim Alharbi | Mark Stevenson

Systematic reviews underpin evidence-based medicine but can outdate quickly when new evidence appears. We formulate a novel prediction task: given a review and new studies that have appeared since its publication, predict whether the review’s conclusions will change. A dataset of 3,326 Cochrane review-update pairs is constructed and a range of approaches explored including feature-based baselines, zero and few-shot LLMs, in addition to parameter efficient fine-tuning. Fine-tuning Qwen2.5 14B achieves the highest AUC-ROC (70.4%).

VaxScope: Document-Level Structured Evidence Extraction from Immunization Systematic Reviews
Bahar Ilgen | Ebenezer Awotoro | Georges Hattab

Systematic reviews are fundamental to evidence-based medicine, but the clinical evidence they contain is primarily expressed in unstructured text, making large-scale extraction and reuse difficult. Existing biomedical NLP methods have achieved strong performance on span-level extraction from clinical trials and abstracts; however, these approaches are insufficient for systematic reviews, where evidence is often distributed across multiple studies, sentences, and sections and must be aggregated into normalized document-level attributes. We introduce VaxScope, a benchmark dataset for document-level structured evidence extraction from immunization-related systematic reviews. VaxScope is constructed through an expert-guided semi-automatic annotation pipeline that combines automatic candidate generation with domain expert validation to ensure consistency and annotation quality. We formalize the task as document-level structured extraction, where target labels are defined at the review level and require aggregating evidence beyond isolated textual spans. We further establish baselines for document-level structured extraction using abstract-level input representations and evaluate how access to evidence-grounded contextual input improves performance over abstract-only settings. Baseline experiments show that PubMedBERT achieves the best overall performance (Avg F1: 0.850), with evidence-grounded input improving performance particularly for fields requiring distributed contextual reasoning.

Medical Context Variation: A source of impairment for Event classification
Aman Sinha | Marianne Clausel | Mathieu Constant | Xavier Coubez

The variation in writing style encapsulates nuanced characteristics, which are often exploited for author or demographic identification. In the medical domain, language models are frequently deployed to capture relevant information from unstructured or complex data, such as clinical notes that often include patients’ medical histories. Such data is largely free-form and unstructured, obtained through diverse clinician?patient interactions. In this work, we present a case study investigating whether variations in clinicians’ writing styles can lead to differences in medical context understanding capabilities for pre-trained language models (PLMs) on downstream tasks, such as medical event classification. Our findings indicate that variation in writing style, characterized by linguistic features, can indeed lead to suboptimal performance in deployed systems. Furthermore, we explore linguistic guided counterfactual reasoning in order to mitigate the impact of writing style variation which suggests LLM-based stylistic normalization to be effective for this purpose.

KALIMBA: Knowledge-Assisted Literature Mining for Biological Interaction Analysis
Niloofar Arazkhani | Maciej Kotecki | Brent Cochran | Natasa Miskov-Zivanov

The exponential growth of biomedical literature has made manual curation of biological interaction networks increasingly difficult. Existing automated biological interaction extraction systems address the scaling challenge but treat extraction as a final step, delivering structured output with limited or no integrated support for biologists to interactively verify, correct and contextually interrogate extracted interactions against their source evidence within the same environment. We present Knowledge-Assisted Literature Mining for Biological Interaction Analysis (KALIMBA), an end-to-end, human-in-the-loop platform that integrates three complementary extraction methods (NLP-only, LLM-only, and hybrid) alongside expert annotation and evidence-grounded conversational querying through retrieval-augmented generation (RAG) chat module driven by a dual-context prompt, within a single unified workflow. Evaluation on a corpus of 40 signaling-focused papers demonstrates that the LLM-only back-end recovers substantially more interactions than the NLP-only approach. RAG chat evaluation by a domain expert confirms that the conversational module provides scientifically grounded responses that support curation decisions beyond what the structured interaction table alone conveys.

When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG
Erfan Nourbakhsh | Rocky Slavin | Ke Yang | Anthony Rios

Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1–2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.

CrossDDI: Cross-Source Evidence-Grounded Drug-Drug Interaction Verification
Bohao Chu | Norbert Fuhr

LLM-based drug–drug interaction (DDI) assessment remains difficult to audit when predictions are not explicitly tied to evidence. While retrieval-augmented generation (RAG) improves grounding, predictions are not guaranteed to be entailed by retrieved items. We present CrossDDI, a verification-first framework that separates LLM-based evidence extraction from deterministic, LLM-free arbitration over DrugBank and PubMed, requiring positive predictions to be linked to explicit supporting evidence. Evaluated on 1,000 DDInter 2.0 pairs under a positive–unlabeled setting, CrossDDI achieves recall of 0.576–0.593 over confirmed positives with interaction prediction rates comparable to RAG, while reducing cross-backbone variation (0.018 vs. 0.066). Analysis identifies literature evidence acquisition and attribution as the primary bottleneck: PubMed retrieval covers only 40.5% of confirmed positives, and Path B-only evidence is substantially less reliable than structured evidence. These results suggest that verification-first architectures can improve traceability and backbone consistency, while broader and more reliable literature evidence is needed to extend coverage beyond structured sources.

GRAFT: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction
Yuhang Jiang | Ramakanth Kavuluru

Even in the era of large language models (LLMs), biomedical relation extraction (RE) still plays a major role in timely creation of knowledge graphs that further guide biomedical knowledge discovery. The main task in RE is to extract a relation "as expressed" in an input text. At times, crucial definitional information or other auxiliary information about the entities involved may be missing from the input text. Augmenting it from other external textual sources appears helpful on the surface but can be harmful too, as these sources can overwhelm the signal in the original input, leading to false positives or false negatives. To counter this, we leverage a pre-trained biomedical text retriever to augment original inputs with additional instance-specific snippets. This is done through a gating mechanism that allows the retrieved snippets to enhance but not overwhelm the signal from the original input. We evaluate our approach on three standard biomedical relation extraction datasets (CDR, BioRED, and ChemProt) and show consistent improvements (up to 10 F1 points) compared with strong supervised baselines involving both encoder and decoder models. All our code and the datasets used are available for reuse: \url{https://github.com/bionlproc/GRAFT-RE}.

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations
Hongbin Na | Zimu Wang | Zhaoming Chen | Yining Hua | Rena Gao | Kailai Yang | Ling Chen | Wei Wang | Shaoxiong Ji | John Torous | Sophia Ananiadou

We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

SCoPE: Planning for Hybrid Querying over Clinical Trial Data
Suparno Chowdhury | Manan Choudhury | Tejas Anvekar | Muhammed Khan | Kaneez Khakwani | Mohamad Sonbol | Irbaz Riaz | Vivek Gupta

Systematic reviews of clinical trials require analysts to extract attributes that are rarely stored as ready-made columns. For example, the drug class of an immunotherapy named in a regimen, the additional agents combined with it, or whether a listed endpoint is a primary or secondary outcome. These attributes must be inferred from the visible content of other fields through normalization, classification, or structured extraction, and existing approaches such as direct LLM prompting, text-to-SQL, and agentic pipelines leave this reasoning implicit in a single generation step or pay a heavy execution cost for limited accuracy gains. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, BlendSQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution.

Expert-Guided Schema-Based Structured Extraction from CONSORT Diagrams Using Vision-Language Models
Damian Stachura | Bartosz Przechera | Monika Opa?ek | Ewelina Sadowska | Ewa Borowiack | Artur Nowak

Visual-language models (VLMs) are rapidly advancing on tasks that require visual understanding of text, tables, plots, and diagrams. Yet extracting structured information from text-heavy scientific diagrams remains challenging, as it requires not only OCR but also recovery of layout, grouping, and flow relationships. We study this problem in the context of CONSORT flow diagrams, which summarize participant screening, randomization, follow-up, and analysis in randomized controlled trials. We introduce a 200-example benchmark of PubMed Central diagrams, annotated by a biomedical team specializing in systematic literature reviews and clinical evidence extraction, and evaluate schema-constrained CONSORT extraction across proprietary and open-weight model families. Using structure-aware metrics, we compare single-pass and stepwise extraction strategies. Expert-guided single-pass extraction performs best for proprietary frontier models, with Gemini 3 Pro achieving the strongest overall results, whereas stepwise prompting improves less capable open-weight models on challenging arm-level extraction. These results offer practical deployment guidance and suggest that high-quality schema-constrained extraction is feasible, but not yet solved.

From Rules to Predictions: Federated Tabular Learning with LLM Reasoning
Afsaneh Mahanipour | Hana Khamfroush

Tabular data is widely used in important areas such as healthcare and finance, but building accurate models in real-world settings faces three main challenges: protecting data privacy, handling distributed data, and maintaining strong performance. Existing methods do not solve these issues together. Converting tabular data into text for Large Language Models (LLMs) can expose sensitive information, struggle with anonymized features and exact numerical values, and require expensive training while often not outperforming traditional tree-based models. In addition, many real-world datasets are spread across different institutions, making centralized training impossible. We propose a federated framework that connects distributed tabular data with LLM reasoning using decision tree rules as privacy-preserving intermediaries. Each client trains a local Random Forest and shares only extracted rules?feature comparisons and thresholds, without revealing raw data. These rules are combined into a global pool, allowing an LLM to generate a better partitioning rule without accessing any original data, adding an extra layer of privacy. Using this rule, each client learns local gradient-based corrections, which are then aggregated. We also show that this process reduces prediction error. Experiments on 12 datasets, including seven medical tasks, show that our method consistently outperforms federated baselines and achieves results close to centralized models.

MedBench: Deliberative Evaluation of Medical Language Models
Pratik Jalan | Mukul Joshi | Akhilesh Magotra | Kshitij Jadhav

We introduce MedBench, a benchmark for evaluating medical language models as deliberating agents rather than isolated predictors. MedBench evaluates eight models (4B?32B) on 19,625 questions from six medical QA datasets using Consensus-Aware Model Panel (CAMP), a two-tier protocol in which five 4B?8B models answer independently, revise after observing peer reasoning, and escalate persistent disagreements to larger 20B?32B models. Compared with zero-shot, few-shot, and chain-of-thought baselines, CAMP shows that deliberation is not uniformly accuracy-improving, but reveals interaction-driven behaviors hidden by single-model evaluation. On PubMedQA without external context, the 4B?8B panel outperforms the evaluated 20B?32B individual zero-shot models (54.1% vs. 33.9%), and achieves the best evaluated result with context (75.7%), suggesting that structured interaction can sometimes complement scale. Across five datasets, initial inter-model agreement is positively associated with correctness and serves as a useful difficulty signal. However, on MedXpertQA, unanimous agreement yields only 6.6% accuracy despite 14.4% overall accuracy, suggesting correlated ignorance, where shared biases make consensus misleading. Error analysis shows that most failures are debate-insufficient cases, where incorrect majorities persist despite interaction (93?97%), while debate-harmful cases account for 3?7%. MedBench positions deliberative evaluation as a complement to accuracy-centric benchmarking, measuring when model interaction corrects errors, reinforces shared mistakes, or signals the need for stronger evidence and human review.

Fast, Accurate, and Local Conversion of MIMIC-IV to OMOP with DBT
Adam Sutton | Niko Moller-Grell | Thomas Searle | Richard Dobson

dbt mimic omop is a free, open-source resource that converts the MIMIC-IV dataset to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) format on consumer level hardware. CDM approaches are increasingly adopted in both industry and academia due to the need for interoperability and reproducibility, including in clinical NLP tasks such as cohort selection, information extraction, and retrieval-augmented generation. The MIMIC-IV database is among the most widely used critical care research datasets, yet existing pipelines to transform it to OMOP depend on enterprise database infrastructure and complex orchestration, limiting accessibility for practitioners and resource-constrained researchers. We further integrate free-text clinical notes (195.6M clinical annotations) and chest radiographs into the OMOP note nlp and imaging extension tables, making all MIMIC-IV modalities (structured data, free-text, and imaging) accessible through a common data model. This resource generates a more comprehensive dataset than existing alternatives and is intended to be used to aid in system development, testing, and evaluation.

Exploring Novel Drug Research Area using Large Language Models Based on Research Trends in Biomedical Literature
Afnan Afnan | Michael Van Supranes | Tomohiro Nishiyama | Shoko Wakamiya | Eiji Aramaki

The rapid expansion of biomedical literature makes manual identification of novel drug-disease relationships increasingly difficult. Existing approaches have leveraged LLMs to mine abstracts or construct knowledge graphs for drug repurposing. There are two key limitations: finite context windows for capturing macro-level research trends, and single-pass black-box pipelines make it difficult to verify outputs. This paper proposes a pipeline for discovering new drug targets by combining disease and drug research trends using Large Language Models (LLMs). Our method extracts PICO components from PubMed abstracts, normalizing the Population and Intervention Component to ICD and ATC codes, respectively. A temporal frequency delta matrix is constructed to capture publication count shifts across 2013 to 2022, then used to discover novel drug areas. Compared with the abstract-based baseline, our approach showed qualitative signs of generating combinations that were more closely aligned with observed research trends and, in some cases, more clinically plausible. These findings suggest the potential usefulness of structured trend information for LLM-based exploration, although the differences between the two methods were limited and the results remain preliminary. Future work will focus on validating the consistency and reliability of these candidates.

FHexchange: Resources for Family Health History Extraction and Normalization From Consumer Dialog Sources
Michelle Nguyen | Nidhi Soley | Ayah Zirikly | João Sedoc | Casey Taylor

Family health history (FHx) offers insight into a person’s health and disease risk, but it is largely held within free-text clinical formats that require processing for maximal utility of the data. The rapid deployment of ambient AI scribes and conversational agents in clinical settings necessitates evaluation on dynamic patient-clinician and patient-agent dialogs. To address this gap, we introduce two new datasets of patient FHx dialog documents designed to benchmark information extraction and entity linking. Distinct from clinician-entered datasets, patient-reported dialog data has its own semantic and content characteristics, which need to be studied for more patient-centered healthcare. We contribute a publicly available resource called FHexchange, with new annotations for family members, clinical observations, related entities, and standardized UMLS CUIs, offering the clinical NLP community a robust evaluation bed for emerging generative AI tools.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
Rez Samantha Floresca | Edric Castel Hao | Hannah Grachiella Buñales | Chelsea Dominique Temprosa | Georgianna Reyes | Kervin Gabriel Chua

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino?English code-switching is pervasive and no prior work has addressed NLP-based dementia detection.We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969–0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
Shubham Nigam | Suparnojit Sarkar | Piyush Patel

We present IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu). The dataset extends the MDDial corpus with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors introduced during automatic translation. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation (LoRA) of a quantized small language model, incorporating an optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate IndicMedLM against zero-shot multilingual baselines across ten languages and conduct systematic error analysis, identifying five failure modes: Instruction Drift, Label Collapse, Cross-Domain Confusion, Tokenization Failure, and Paraphrase-over-Label Generation. Results show strong post-processed diagnostic accuracy in Hindi, Marathi, and Bengali, while Assamese, Tamil, and Telugu remain in an extreme failure tier attributable to base-model tokenizer gaps, a finding with direct patient safety implications. Medical expert evaluation confirms the clinical plausibility and safety of the generated consultations.

Towards a Radiologist Imitation Framework for 3D CT Diagnosis with Multimodal LLMs
Kaidi Zhang | Zhiyuan Yan | Gao Cheng | Zhenyang Cai

Three-dimensional Computed Tomography (3D CT) is a cornerstone of precision medicine. Most AI diagnostic models analyze large num bers of CTslices uniformly, treating all slices as equally important. While this has partly accel erated radiologists’workflows, it overlooks that clinically relevant information is often sparsely distributed throughout a volume. Without tar geted or weighted processing, fine-grained cues may be missed and substantial computation wasted on diagnostically uninformative slices. Wepropose aradiologist-simulating framework for selective and efficient 3D CT interpreta tion. Evaluated on a 3D CT dataset covering eight thoracic lesion types, it was compared with state-of-the-art multimodal large language models such as GPT-4o and supervised visual backbones including ViT and ResNet-50. Us ing accuracy, F1-score, AUC, and blind radiolo gist assessment, Screen-CLIP achieved an AUC of 0.87 and F1-score of 0.82, surpassing ViT Base (AUC: 0.84). For report generation, our method outperformed M3D across all metrics, reaching a BLEU-Avg of 29.03, and achieved the highest average Doctors’ Score (6.16/10) in a preliminary human evaluation.

Probing and Steering Uncertainty in Biomedical Language Models: Representational Structure and Behavioral Limits
Debmalya Pal

Biomedical language models can generate overly confident clinical statements despite incomplete or ambiguous evidence. We study whether linguistic uncertainty (the hedged epistemic stance expressed in phrases such as "consistent with" or "cannot exclude") is encoded in model representations and can be controlled without retraining. Across six biomedical language models spanning two architectures (causal decoders and bidirectional encoders), we show that uncertainty is captured by robust low-dimensional linear structure in hidden states. We then apply activation steering to manipulate this representation directly, increasing hedged generation in decoder models and inducing targeted uncertainty related shifts in encoder representations. Together, these results show that epistemic stance is not merely a surface linguistic phenomenon but an interpretable and controllable feature of biomedical language model representations, with implications for safer and more calibrated clinical text generation.

Relations of Linguistic Features and Medical Text Preferences are Nontrivial
Davis Bartels | Brandon Colelough | Dina Demner-Fushman

We study how simple linguistic features relate to reader preferences in medical question answering. Our dataset contains answers to medical questions ranked in order of quality. We examine eight interpretable features of the answer text: length in words, average words per sentence, percentage of polysyllabic words, medical named entity density, perplexity, coherence, and dependency distance. We find substantial variation across annotators in both the strength and direction of these relationships. Answer length shows some of the strongest associations and predictive signals, but preferences are not consistent across annotators, with some favoring longer answers and others favoring shorter ones. A leave-one-out ablation study shows the relative impact on the predictive accuracy of our models. Overall, these results suggest that linguistic form can influence reader preference in medical text, but that these effects vary across readers and may be more complex than simple linear correlations.

Overview of the MedGenVidQA 2026 Shared Task on Medical Generative Video Question Answering
Deepak Gupta | Collin Campbell | Pedram Golnari | Dina Demner-Fushman

This paper presents an overview of the MedGenVidQA 2026 shared task on medical video question answering, collocated with the 25th BioNLP workshop at ACL 2026. The shared task addressed three related sub-tasks of the medical multimodal (textual and video) question answering: (i) multimodal retrieval tasks, (ii) multimodal answer generation with citations, and (iii) a visual answer localization task. The key theme of the stated task is to develop reliable multimodal question answering systems for consumers and medical professionals by leveraging generative models. A total of nine teams participated in the shared task challenges and submitted a total of forty-three submissions across all tasks. We performed both automated and human assessments to evaluate the submissions. This paper describes the tasks, datasets, evaluation metrics, participation, and baseline systems for all three tasks. Additionally, we summarize the techniques and results of the evaluation of the various approaches explored by the participating teams. Finally, we discuss the key findings and implications for the development of multimodal medical question answering.

Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment
Xiyang Huang | Renxiong Wei | Yihuai Xu | Zhiyuan Chen | Keying Wu | Jiayi Xiang | Buzhou Tang | Yanqing Ye | Jinyu Chen | Cheng Zeng | Min Peng | Qianqian Xie | Sophia Ananiadou

This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.