Biomedical Natural Language Processing Workshop (2026)


up

pdf (full)
bib (full)
BioNLP 2026

Computational mental health (CMH) classifiers often degrade under distribution shift because human annotators and distant-supervision pipelines reward different linguistic signals. We introduce TSS (Triple-Stream Stress probe), a multi-channel diagnostic framework that decomposes text into (A) lexical character n-grams, (B) a small, mostly content-free morpho-syntactic channel, and (C) a 154-feature psycholinguistic style channel. Across four English datasets (N = 12,906), TSS reveals a lexical interference effect: adding lexical features to the style channel reduces Macro-F1 on human-labeled data (mean drop 0.072, p 10??) but not on auto-labeled data. We propose Degree of Divergence (DoD), a difference-in-differences statistic adapted from econometrics for label-source auditing, with instance-level bootstrap inference; the headline estimate is DoD(BC?A) = 0.0374, 95% CI [0.0097, 0.0651], p = 0.0032. A platform-stratified Twitter-only DoD (which removes the Reddit vs. Twitter contrast) reproduces the pattern with bootstrap inference: DoD??,BC?A = +0.096 (p 0.001) and DoD??,AC?A = ?0.089 (p 0.001). Interventional masking (pos_only) retains ?95?99% of Channel C’s performance after destroying content words on human datasets, indicating that the style channel does not rely primarily on lexical surface form. TSS is positioned as a diagnostic audit framework, not a clinical screening tool: it flags label-source-specific shortcut learning before generalization claims are made.
Large language models achieve strong performance on biomedical question answering and summarization benchmarks, yet traditional evaluation metrics often fail to detect clinically significant factual errors. We introduce a unified evaluation framework that combines reference-based measures with evidence-grounded factuality verification to assess biomedical text generation. Evaluating four open-source models across three benchmarks (BioASQ, PubMedQA, MedLFQA), we find that 13.4?24.7% of generated claims are contradicted and 23?41% are unsupported, despite high lexical overlap scores. Our proposed Fact-Aligned Score (FAS) correlates strongly with claim-level verifiability (rho=0.68), substantially outperforming ROUGE-L (rho=0.41). We release an open-source toolkit with model outputs and analysis scripts to support reproducible factuality evaluation and safer deployment of biomedical LLMs.
Seizure freedom is a key clinical outcome for people with epilepsy (PWE) yet it is primarily recorded in free-text notes and letters in the United Kingdom, making it difficult to aggregate and track at scale. This paper introduces a generative LLM-based pipeline boosted by synthetic data to identify a PWE’s seizure freedom status in clinicians’ records. We fine-tuned seven different LLMs with between 4-14 billion parameters using LoRA to compare models trained on synthetic records against those trained on expert annotated records. The best performing configuration, based on Qwen-2.5-14B, was trained entirely on synthetic records and used chain-of-thought (CoT) reasoning (both generated by GPT-5). This achieved an F1 score of 0.90±0.02 on double-annotated test data and outperformed the equivalent model trained on authentic clinician records, which achieved 0.87±0.04. The synthetically trained models also have the benefit of outputting their CoT reasoning process for greater decision-making transparency and can also make use of the unused supervised training data for significantly increased test examples. This work has implications for monitoring a key treatment outcome for PWE automatically and at scale.
This paper studies how to improve biomedical named entity recognition (NER) using large language models (LLMs), especially for low-resource languages like Bangla and Basque. The main goal is to understand how different prompt styles and output formats affect model performance. The study finds that the way we design prompts is very important. Among all methods, question-style prompting works best across all languages. It helps the model understand the biomedical task more clearly and improves accuracy. In fact, improvements are much greater in Bangla and Basque compared to high-resource languages like English and Spanish. Another key finding is about the output format. Traditional BIO tagging (labeling each word) performs poorly with LLMs because it is strict and sensitive to small errors. Instead, span-based extraction (directly extracting text phrases) works much better and gives higher F1 scores. This is because LLMs naturally generate text spans rather than token-level labels. The paper also analyzes errors. Common problems include hallucination, missing entities, and boundary mistakes. Translation-based prompts can reduce hallucination, while question-style prompts reduce empty outputs in biomedical NER. Overall, the study shows that choosing the right prompt and output format is very important, especially for low-resource high-vocabulary languages. It provides useful guidance for building better multilingual medical information extraction systems.
Extracting structured knowledge from unstructured text is a fundamental challenge in machine learning, particularly for concepts organized within complex hierarchical ontologies. In genomics, identifying phenotypes from clinical narratives is crucial for diagnostic precision, yet current methods struggle with contextual interpretation and subtle clinical descriptions. We present a hierarchy-aware workflow for ontology-based phenotype linking that combines semantic and hierarchical signals. Our approach integrates Large Language Models for span detection with retrieval and a hybrid reranking strategy using both Euclidean (semantic) and hyperbolic (hierarchical) embeddings trained on the Human Phenotype Ontology. We show that while hyperbolic embeddings alone do not outperform standard semantic retrieval, they provide complementary structural signals that improve performance over strong baselines when combined with Euclidean representations. In particular, the hybrid approach outperforms existing state-of-the-art methods and yields more hierarchically coherent predictions, especially in settings involving implicit phenotype mentions. Experiments on a public benchmark (ID-68) and a newly released clinical dataset (CHU-50), publicly released with code and data, highlight both performance gains and improved alignment with ontology structure. We further introduce a hierarchy-aware evaluation framework that reflects clinical relevance beyond exact-match metrics.
Automated epileptic seizure detection from electroencephalography (EEG) signals is a clinically important task in which feature selection is typically performed using purely statistical criteria. We investigate whether a small instruction-tuned large language model (LLM) can guide iterative feature selection for binary seizure detection on the Epileptic Seizure Recognition dataset (11{,}500 samples, 178 features). The LLM agent (Qwen2.5-1.5B-Instruct) receives five complementary statistical summaries and selects a feature subset through multi-round reasoning. The agent achieves 96.5\% accuracy and 0.911 F1 with 40 features, compared to 97.9\% accuracy and 0.946 F1 for the best full-feature baseline (SVM-RBF on 178 features). Critically, 39 of the agent’s 40 features coincide with the top-39 mutual-information features, and a deterministic Top-39 MI filter, evaluated by the same Random Forest classifier, attains the same 96.5\% accuracy and 0.911 F1. We therefore present this work as an empirical baseline: at the 1.5B-parameter scale, the LLM behaves close to a univariate MI ranker. We situate the result against the recent LLM-based feature selection literature and enumerate the ablations and multi-dataset extensions required to determine whether larger or domain-specialized LLMs add value beyond statistical filtering.
The MedCPT model has demonstrated that strong biomedical retrievers can be trained using proprietary PubMed search logs. In this work, we study whether freely available citation sentences are sufficient to train similarly effective models. We construct a large-scale training dataset of ~ 62 million citation sentence-abstract pairs extracted from PubMed Central. We train a lightweight BERT-based retriever-reranker model called CiteRec on this dataset and evaluate it across three benchmark settings: (a) the biomedical subset of BEIR for information retrieval, (b) SciRepEval for generalizable scientific document embeddings, and (c) CitancePlus, a new set of ~ 90 thousand citation sentence-abstract pairs for PubMed-scale citation recommendation. We show that CiteRec performs competitively with MedCPT on the biomedical BEIR subset and outperforms it on SciRepEval. On CitancePlus, CiteRec achieves strong performance for citation recommendation over the full PubMed corpus, outperforming both MedCPT and a substantially larger Qwen3-Embedding-8B retriever.
Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.The obtained results show that explicit, decoupled uncertainty quantification is essential for translating biomedical NLP into responsible clinical practice.
SciFact is a widely-used benchmark for scientific claim verification (645 citations, included in the BEIR evaluation suite). We present, to our knowledge, the first systematic annotation audit of its development and training sets, combining automated screening with a small language model ($0.11 in API fees) and exhaustive manual verification against source publications. We identify 11 gold-label errors in the development set (5.3%, 95% CI 2.7?9.2%, of 209 audited claim?document pairs) and 13 in the training set (2.3%, 95% CI 1.2?3.9%, of 564 audited pairs). The dev errors exhibit a directional asymmetry?9 of 11 mislabel a claim as SUPPORT (one-sided binomial p=0.033, two-sided p=0.065)?and fall into four recurring types. Correcting the dev labels raises binary macro-F1 by 1.7?3.8 points across GPT-5.4 (mini, nano) and Claude Haiku 4.5; gains are larger in 3-way evaluation when mislabeled evidence is recast as NEI (e.g., +9.2 with Haiku 4.5). The binary range is comparable in magnitude to inter-system margins on the SciFact leaderboard. A simple claim-only probe with Haiku 4.5 does not support label memorization as the main explanation for these gains. We release corrected annotations and a blind annotator packet, and recommend that benchmark users prefer the corrected release going forward.
Retrieval strategy selection is a critical but understudied design decision in biomedical RAG systems. Existing evaluations rely on lexical metrics that miss answer grounding, or require proprietary infrastructure that limits reproducibility. We present BioRAG, a head-to-head ablation of seven retrieval strategies on BioASQ-13b, evaluated using four RAGAs metrics with a locally deployed judge at zero monetary cost. Hybrid BM25 plus dense retrieval with Reciprocal Rank Fusion achieves faithfulness of 0.534 and context recall of 0.507, improvements of 50% and 85% over naive dense retrieval, confirmed across three random seed re-samples. HyDE improves faithfulness by 14% but reduces context precision by 52%, a tradeoff not previously documented on BioASQ. No single strategy dominates all four metrics, indicating that strategy selection must be application-driven. Sensitivity analysis across k in {3,5,10} confirms ranking stability. A domain mismatch diagnostic confirms 2% corpus coverage failure. The full pipeline runs on consumer hardware without paid APIs, directly addressing BioNLP 2026’s emphasis on reproducibility and evaluation frameworks for health-related applications.
De-identification systems prioritize recall to protect privacy, but excessive over-tagging reduces data utility. We propose an agentic refiner that reviews high-recall annotations using lightweight tools (validation functions, adaptive context retrieval, persistent to-do state, and modular review skills) to improve precision while minimizing recall loss. Experiments across three multilingual datasets show that the agent achieves significant improvements to binary precision. To support fine-grained analysis, we further introduce a synthetic error dataset of common and systemic failure modes, on which the agent corrects 99% of injected errors in the medical datasets. Our results suggest that agent-based refinement provides a flexible and effective mechanism for improving de-identification precision as a modular extension to existing high-recall systems.
We investigate whether explicit syntactic features improve transformer-based biomedical relation extraction when added to typed entity marker pooling. We evaluate two augmentation strategies on top of BiomedBERT: (1) verb token augmentation, which concatenates the hidden state of the dependency root verb to the entity representations, and (2) a two-layer graph convolutional network (GCN) that refines encoder hidden states over the dependency parse before entity pooling. We experimented on three biomedical datasets: ChemProt, DDI, and AIMed with three random seeds. We found neither strategy consistently outperformed the entity-only baseline. The GCN yielded modest gains on AIMed (+0.007 F1) and ChemProt (+0.003 F1) but decreased performance on DDI (-0.013 F1). Verb token augmentation helps only on AIMed (+0.004 F1) and underperforms on the other two datasets. A syntactic characterization of the datasets reveals that DDI has substantially higher passive voice usage (50.7\% of relation-bearing sentences) than AIMed (27.0\%) or ChemProt (30.9\%), suggesting that syntactic augmentation is more effective when sentences exhibit active verbal structure with semantically informative predicates. These results suggest that corpus-level syntactic characteristics, particularly passive voice usage, may moderate the utility of explicit syntactic augmentation, though the small magnitude of observed differences warrants caution in interpretation.
Drug repurposing methods rely heavily on knowledge graph (KG) embeddings, but building and curating these graphs takes considerable effort. We present two findings on the Hetionet drug-disease benchmark and an epilepsy ranking task. First, PubMedBERT text embeddings, fed through the same downstream classifiers and identical 10-fold splits as four re-trained KG baselines (TransE, ComplEx, DistMult, RotatE), reach AUROC $0.910$, above all four (best: RotatE, $0.854$); a Random Forest on the same vectors scores $0.880$. The comparison is asymmetric in one important way: PubMedBERT was pretrained on the literature Hetionet was curated from, so the result is best read as “text-with-literature-supervision vs.graph-only,” and a head-to-head with text-augmented KG methods (KG-BERT, TxGNN) is left as follow-up. Second, across all seven combinations of text, molecular (ECFP4), and gene expression (LINCS L1000) features, cross-attention fusion of weaker modalities into text consistently degrades performance, despite a gated mechanism intended to suppress unhelpful modalities; the residual path forces the strong modality to absorb noise. The model also ranks proconvulsants (amoxapine, flumazenil) near the top, because text embeddings encode strength of association with a disease but not its direction.
Clinical NLP systems are increasingly used for triage support, prediction, and decision assistance in EHR-based settings, where demographic fairness is a critical concern. A common evaluation approach is counterfactual demographic perturbation: modifying attributes such as age or sex while holding clinical evidence fixed and measuring output changes. However, we show that such counterfactual audits can be misleading when interpreted in isolation. Across three clinical LLMs, we find that non-demographic control perturbations (e.g., paraphrases) often induce output variability comparable to or greater than demographic edits. This can contribute to overestimation or misinterpretation of demographic bias.To address this, we propose a baseline-aware audit framework that explicitly compares demographic perturbations against control baselines. Our analysis reveals that (i) label-level stability can mask substantial variation in generated rationales and recommendations, and (ii) age-based perturbations generally induce larger effects than sex-based ones in borderline cases. Crucially, we identify a high intrinsic instability ("noise floor"; 0.46–0.71 Jaccard instability) in clinical LLM generations, while additional matched-metric analyses show that demographic perturbations are often comparable to non-demographic baseline variability.These findings highlight a key limitation of existing fairness evaluations: without establishing appropriate baselines, apparent demographic sensitivity may be over- or mis-attributed to bias rather than broader generative instability. We argue that baseline-aware counterfactual audits, which explicitly compare demographic effects against intrinsic model noise, provide a more reliable lens for evaluating clinical NLP systems in high-stakes settings.
Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we develop an open-source, domain-agnostic framework for aligning Large Language Models to embedding spaces using the recently reported Embedding Language Model (ELM) method. We demonstrate our framework by training models to recover, summarize, and compare clinical trial abstracts from embeddings alone. In addition to inverting embeddings back to text more reliably than existing methods, our models can decode novel, interpolated embeddings into new clinical trial abstracts that human experts cannot distinguish from real ones. We further show that these generated abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
Clinical decision support systems that operate across multiple downstream care pathways must first determine which pathway or pathways are relevant for a given patient. We study this routing problem in gastrointestinal surveillance, where paired endoscopy and histopathology text reports may indicate multiple concurrent conditions and therefore require multi-label routing. In this context, standard hard-label evaluation can be insufficient: a model may achieve reasonable overall performance while still excluding clinically important pathways when uncertain. We formulate gastrointestinal report routing as a multi-label uncertainty-aware classification task over six pathway labels and compare lightweight lexical baselines, frozen embedding models and a fine-tuned transformer baseline under two complementary uncertainty mechanisms: threshold-based abstention and set-valued conformal prediction. Using 1,773 paired reports from a single NHS trust with disjoint train, calibration and test splits, we evaluate both hard-routing performance and the downstream review burden introduced by uncertainty-aware prediction. The fine-tuned ClinicalBERT model achieved the strongest overall performance (0.811 subset accuracy, 0.861 macro-F1) and the lowest AURC of 0.084 under min-margin abstention. Threshold-based abstention consistently reduced exact-match routing error on accepted reports. For conformal routing at ?=0.10, Mondrian calibration achieved high mean positive-label recall coverage across learned baselines (0.883-0.917). The fine-tuned model achieved 0.891 mean recall coverage with a mean prediction set size of 1.70, 0.642 candidate-label precision and 0.61 false-positive labels per report. Compared with a recall-tuned threshold baseline at similar recall, Mondrian CP produced smaller candidate sets, higher candidate-label precision and fewer false-positive pathway suggestions. These results show that uncertainty-aware evaluation exposes clinically important failure modes missed by aggregate metrics. They also show that high-recall routing is not cost-free: set-valued prediction can reduce missed-pathway risk but must be interpreted as candidate generation for downstream review rather than automated pathway selection.
MedCAT is an open-source framework for clinical named entity recognition and linking (NER+L) widely used in research and healthcare settings. We present MedCAT v2, a re-engineered version designed to improve modularity, extensibility, and maintainability while preserving the core functionality and performance of previous releases. The new architecture introduces a registry-based component system and a flexible pipeline that enables easy substitution of components, integration of alternative methods, and future expansion, including support for pre-trained components across the full NER+L and contextualisation workflow. This enables systematic exploration of clinical NER+L design trade-offs by evaluating different components in the pipeline. Evaluation across multiple public datasets shows equivalent or improved performance compared to earlier versions, with reduced integration overhead and improved runtime flexibility. The framework also supports optional extensions such as meta-annotation, relation extraction, providing a unified and reproducible environment for clinical NLP in real-world settings.
Due to unique concepts, syntactic structure, and vocabulary of specialized domains, it is common to train specialized Language models (LMs) for their target domain. For example, BioClinicalBERT is a specialized LM designed for clinical applications. These specialized LMs are typically created starting with a foundation model (such as BERT-base) which has been pretrained for the general English domain, and then adapted to the target domain via additional pretraining. Alternatively, LMs may be pretrained from scratch on data from the target domain. Both techniques are extremely computationally expensive and as such, these specialized LMs are often publicly released for other researchers. For some domains, such as the biomedical domain there are many, similar models available, and as a developer, this raises the question, which pretrained LM should I choose? Alternatively, in novel domains for which no specialized LMs exist, it raises different questions: Is it worth the cost to pretrain a LM from scratch? Should I adapt a general English model instead? Should I just use a general English model without adaptive pretraining? This is a particularly salient question when considering a limited budget. i.e. Should I pay for compute time or for annotators to create a larger dataset. In this paper we compare results of nine LMs across nine datasets spanning the clinical, scientific, and biomedical-related social media domains. From these comparisons we make several conclusions that can simplify the hyperparameter-tuning process and inform researchers and developers in novel domains. Broadly, these are that the effects of adaptive fine-tuning are small. If an adapted model exists in your domain, choose the one most closely related to your task. If no model exists, using a foundation model is likely sufficient.
The development of large language models (LLMs) has led to increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.
Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label prompt-tuning approach for radiology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.
This paper introduces LEA-Dialog, a multi-turn diagnostic dialogue dataset for lower-extremity arteriovenous diseases, together with a carefully developed diagnostic handbook and a process-aligned agentic framework for structured outpatient diagnosis. The dataset provides stage annotations for each turn and guideline-grounded probability trends, enabling evaluation beyond final diagnostic accuracy. Experiments show that the framework improves reasoning stability and reduces drift across both online and offline LLMs, with particularly large gains for smaller offline models.
Large language models (LLMs) demonstrate strong general language capabilities but remain limited in chemical reasoning, particularly for tasks requiring structured, mechanistic understanding of molecular reactions. We present Knowledge Graph Reaction LLM (KGRxn-LLM), a framework that augments LLMs with a hierarchical chemical knowledge graph (KG) to ground reasoning in molecular transformations and reaction patterns. Existing benchmarks primarily emphasize reaction or molecular fact recall, providing limited assessment of reaction-level mechanistic reasoning. To address this gap, we introduce KGRxn-Bench, a benchmark of 1,200 questions designed to evaluate LLMs on reaction-centric reasoning tasks, including functional group identification, reaction type classification, and product and reagent prediction. Experimental results show that our approach of grounding LLMs in structured KG substantially improves performance across multiple tasks and model backbones, outperforming domain-specific fine-tuned models on KG-covered splits and most hold-out splits.
The global transition to the ICD-11 taxonomy demands robust automated medical coding, yet comprehensive benchmarks to evaluate Large Language Models (LLMs) on this task remain absent. We introduce MAX-EVAL-11, the first large-scale benchmark for full-spectrum ICD-11 medical coding. MAX-EVAL-11 comprises 10,000 MIMIC-III discharge summaries with mapped, expert-validated ICD-11 annotations spanning 99.87\% of the diagnostic taxonomy. To better reflect clinical utility, we propose a novel hierarchical evaluation framework that assigns partial credit based on ICD-11’s 5-level structure, addressing the brittleness of traditional exact-match metrics. Our evaluation of state-of-the-art LLMs reveals significant performance gaps. The best-performing model (Claude 4 Sonnet) achieves a weighted score of 0.433, outperforming both general-purpose peers and specialized medical models (MedCoder). Crucially, all models exhibit near-zero exact match rates (0?4.8\%) and rely primarily on hierarchical credit, underscoring the extreme difficulty of precise ICD-11 code generation. Furthermore, the superiority of general-purpose LLMs over legacy ICD-10 medical models (with ICD-11 codelist) suggests that broad reasoning capabilities currently outweigh domain-specific training for complex taxonomy scaling.
Reliable extraction of structured information from radiology reports using Large Language Models (LLMs) remains a significant challenge, particularly for complex, non-English texts such as Hebrew. This study proposes an agent-based, uncertainty-aware framework to enhance the reliability and interpretability of LLM predictions in clinical contexts. A total of 9,683 Hebrew radiology reports from Crohn’s disease patients (2010?2023) across three medical centers were analyzed. Of these, 512 reports were manually annotated for six gastrointestinal organs and 15 pathological findings, while the remainder were automatically labeled using HSMP-BERT. Structured data extraction was performed with Llama 3.1 (Llama 3-8b-instruct) employing Bayesian Prompt Ensembles (BayesPE), which utilized six semantically equivalent prompts to quantify uncertainty. An Agent-Based Decision Model aggregated prompt outputs into five calibrated confidence levels and was benchmarked against three entropy-based approaches. Model performance was assessed using accuracy, F1 score, precision, recall, and Cohen’s Kappa before and after filtering high-uncertainty cases. The agent-based model outperformed all baselines, achieving an F1 score of 0.3967, recall of 0.6437, and Kappa of 0.3006; after excluding cases with uncertainty = 0.5, the F1 score increased to 0.4787 and Kappa to 0.4258. The proposed framework improves uncertainty calibration and predictive reliability, advancing the safe deployment of LLMs in medical data extraction.
NER requires token-level classification using both left and right context, which makes encoder-only models like BERT naturally well-suited for the task. Decoder-only LLMs, by contrast, use causal masking during training, so their token representations lack right-side context, limiting their effectiveness on structured prediction tasks like NER despite their strong general capabilities. To address this, the authors propose fine-tuning decoder-only LLMs with causal attention replaced by full attention, combined with label-supervised discriminative training. While similar ideas exist in prior work, those studies were limited in scope. This work evaluates seven LLMs across four model families (Gemma, Qwen2.5, Llama3.1, Llama3.2) and compares full fine-tuning against LoRA. Results show that the proposed approach with an appropriate LoRA configuration outperforms encoder baselines (BERT, RoBERTa, DeBERTa), and achieves strong NER performance without auxiliary data or architectural modifications, though it does not reach SOTA on BC5CDR and CoNLL2003.
Nutrition misinformation on social media often arises from selective interpretation of scientific evidence rather than outright falsehoods, making it difficult to detect. We introduce a curated, expert-annotated Instagram dataset focused on seed oils and omega-6, two domains characterized by contested dietary claims. We evaluate feature-based, embedding-based, and transformer-based models under in-domain and cross-domain settings. Results show strong in-domain performance across all models, with Sentence-BERT achieving the highest AUPRC (up to 0.96). However, performance drops substantially under cross-domain transfer, indicating limited robustness to topic shift. Analysis suggests that while contextual embeddings capture strong in-domain semantic signals, linguistically and psychologically grounded features are more stable under distribution shift. These findings highlight the value of combining semantic and interpretable linguistic signals for robust misinformation detection.
Standard coherence metrics for biomedical topic models encode no clinical knowledge and cannot detect clinically implausible topic groupings. We propose SNOMED CT Wu?Palmer hierarchy distance as a post hoc, ontology-grounded diagnostic. On vascular surgery (47,318 articles) and craniofacial surgery (27,493 articles) corpora, the metric flags clinically heterogeneous topics that coherence misses?e.g., abdominal aortic aneurysm repair grouped with deep vein thrombosis (d = 0.600). Diagnostic signals are nearly identical across eight BERTopic embedding strategies including ontology-enhanced models, but diverge across model families: BERTopic alone produces a positive within- vs. cross-topic Cohen’s d, while LDA, NMF, and Top2Vec at matched topic counts score below their own cross-topic baselines (Cohen’s d 0; Mann?Whitney p 0.99). The score is therefore sensitive to topic-model output choice, not only to embedding choice within a single pipeline. A pre-clustering screening experiment finds near-zero correlation (|?| 0.08) between embedding cosine and SNOMED CT similarity, arguing that ontological validation belongs after clustering rather than as an embedding screen. We additionally describe a two-stage UMLS-CUI stopword filter that preserves high-frequency domain-specific concepts which naive frequency filtering would discard. After one-time concept curation, the diagnostic itself is automated and requires no per-topic expert scoring.
Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect – such as similarity or utility comparisons – even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes – despite their task-agnostic nature – can effectively augment task-specific training for rare ICD codes.
Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.
The advent of single-cell RNA sequencing has enabled unprecedented resolution of cell fate decisions and regulatory mechanisms during peri-implantation human embryogenesis, in which accurate cell type annotation is a fundamental prerequisite and the first step for subsequent fate and mechanism inference. Large language models (LLMs) have demonstrated outstanding performance in various fields. However, current studies mostly rely on traditional methods and have not explored the application of LLMs in the field of human embryonic cell annotation. The main reason is the lack of instruction tuning datasets and evaluation benchmarks. In this paper, we proposed EmCellLLM, the first open sourced LLMs that are specialized for human embryonic cell type prediction task based on fine-tuning Qwen3-8B with EmCell4Instruction, the first embryonic cell type prediction instruction dataset. To support LLM instruction tuning, we also build EmCellBench, the first benchmark for evaluating human embryonic cell type prediction ability of LLMs. We compare our models with a variety of LLMs on EmCellBench, where our model outperforms all other open-sourced LLMs as well as DeepSeek.
Large Language Models (LLMs) are no longer mere laboratory objects of study. LLMs have become everyday tools in society across diverse populations and domains. In clinical contexts, LLMs have already been devised as clinical support applications. However, along with benefits, negative or adverse effects might arise, such as LLMs potentially providing psychologically distressing advice to adolescents when used for mental health support. This raises questions on the benefits of LLMs and calls for real-world evaluations: Are LLMs really helpful and effective for the intended purposes people are using them or will use them for? To answer this type of question we propose to use Randomized Controlled Trials (RCTs). RCTs are considered the most strict experimental design in the fields of Medicine, Psychiatry, Psychology, among others; however, the use of RCTs in the NLP field is almost negligible. In spite of the NLP field being the de facto locus of research on LLMs, other fields, prominently Medicine, are leading the RCT evaluations on LLMs. In this primer paper, we present a concise introduction to the principles of RCTs to guide NLP researchers to design RCT studies for evaluating LLMs.
The biomedical literature contains rich structured knowledge, including citation links that encode relationships between scientific studies, but such information is typically ignored in standard language model pre-training. We propose a citation-aware continual pre-training method for decoder-only language models that incorporates citation graph information from PubMed into next-token prediction by placing citation-linked abstract pairs within a shared context. We evaluate our method on multiple biomedical QA benchmarks using two model families. Results show that citation-aware continual pre-training achieves higher average accuracy than both the original base models and citation-unaware pre-training across biomedical tasks.
While humans can easily produce various types of answers, such as definitions, examples or paraphrases, Large Language Models (LLMs) struggle to provide correct answers to medical questions that require diverse answer formats. In this paper, we introduce TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also propose RefoMed-EN, a medical dataset consisting of 6,170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We investigated whether the high or low frequency of a concept (head or tail knowledge) impacts the language model’s performance for answering medical questions. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s answer quality for definition-type questions is the highest, while for the exemplification-type being the lowest. Additionally, we showed that for definition-type medical questions ("What is multiple sclerosis?"), LLMs are prone to paraphrase more for popular medical concepts, and less on more specialized medical knowledge.
Discharge instructions are patient-facing, safety-critical documents that guide medication use, follow-up care, and recovery after hospitalization. Because they must synthesize information across the clinical record and often include post-discharge guidance not stated verbatim in the EHR, they are a difficult target for clinical text generation. In this work, we study discharge instructions in MIMIC-IV through a grounding-first lens. Using two LLMs, we decompose each discharge instruction into medically relevant statements and verify them against the Electronic Health Record (EHR). We find that discharge instructions for Surgical admissions are much longer, averaging roughly 24–25 statements per admission versus 11–12 in Non-Surgical cases, while supported content remains similar in absolute count. The additional Surgical content is dominated by statements that are not directly stated in the record or require clinically plausible extrapolation. Through this analysis, we advocate for better grounding and completeness evaluations at a fine-grained level, establishing a foundational step toward safer and more reliable discharge-instruction generation.
Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/
Individuals with particular mental health disorders may find it difficult to learn about their own condition. Therefore, efforts have been made to create materials that explain complex medical information in simpler words, which are also beneficial for caregivers and others. However, text simplification is commonly done in English and only sporadically in other languages. In this study, we explore potential ways for language-agnostic medical text simplification for the mental health domain. Our approach is to simplify the ICD-11 articles on primary psychotic disorders in English, German and French, using small LMs and various metrics for evaluating different aspects of the texts: lexical complexity and readability. Our results show that acceptable texts were produced only in English, and that a joint analysis of Measure of Textual Lexical Diversity (MTLD) and Flesch Reading Ease (FRE) provides the most insight, capturing both the best outcomes and signaling different types of issue. The study is preliminary and requires further investigation.
The rapid growth of biomedical literature presents a major challenge for organizing knowledge and identifying emerging research trends. While PubMed provides effective access to relevant articles, it does not support understanding the conceptual structure of document collections. Existing tools rely on predefined features, ontologies, or parameter-sensitive clustering methods, limiting their ability to uncover fine-grained, data-driven topics in a reproducible manner. We present BioTopicXplor, an on-demand web server for interactive exploration of biomedical literature derived from arbitrary PubMed queries. The system integrates ConvexTopics, a convex optimization?based topic modeling framework that guarantees convergence to a global optimum and eliminates the need for predefined parameters. This enables the generation of reproducible and fine-grained topic structures across large document collections. Given a PubMed query, BioTopicXplor retrieves relevant articles, performs topic discovery, and organizes the resulting subtopics into a hierarchical structure of higher-level themes. To enhance interpretability, the system incorporates large language models to generate concise, literature-grounded summaries and descriptive titles for each topic, with links to supporting evidence. We demonstrate the utility of BioTopicXplor through a case study on anti-aging research, where the system reveals meaningful thematic structures and supports knowledge discovery.
Patient portal messages often embed clinical questions inside long, emotionally nuanced narratives, requiring clinicians to infer the underlying information need. We study the task of rewriting verbose patient-authored narratives into concise, clinician-interpreted questions framed as if querying an electronic health record (EHR) system. We evaluate a lightweight LLM-based rewrite pipeline that constrains outputs to 10-15 words and uses rule-based validation with regeneration. We test the approach on 140 distinct patient questions drawn from the ArchEHR-QA dataset and shared task. Each system output is double-annotated by two annotators for quality (Good/Ok/Bad) and error types (Generic, Malformed, Tangential, Hallucination). Results show that while models follow output constraints, they often produce overly generic or tangential questions, and occasional hallucinations introduce unsupported clinical details. Across both clinician-question and patient-narrative comparison settings, automatic metrics show substantial overlap across human quality labels; in pairwise meta-evaluation, BERTScore is the strongest proxy for human preferences. We release our code and annotations to support future work.
Clinical documentation is essential for patient care, billing, and medical research, but it is subject to entrenched bias. While manual chart reviews can identify such bias, they are labor-intensive and expert-dependent. We introduce and evaluate StigMAD, a Multi-Agent Debate framework leveraging open-source Large Language Models (LLMs) to detect stigmatizing language in clinical documentation. We investigate reasoning (multi-agent debate), self-reflection, and self-consistency within this framework. Extensive experiments on clinical notes and patient summaries demonstrate that our framework provides significant advantages over rule-based and supervised baselines. A domain-specific LLM (MedGemma) achieved its highest performance using the StigMAD reasoning framework, while a general-purpose LLM (Llama) showed superior results with the self-consistency framework. These findings suggest that open-source LLMs, steered by structured prompting and reflective reasoning, can effectively support the scalable auditing of stigmatizing language, marking a critical step toward more equitable clinical NLP systems.
Accurate labeling of relevance between biomedical abstracts is essential for improving information retrieval, semantic similarity modeling, training of ranking systems and other Natural Language Processing tasks. However, manual annotations are time-consuming, labor intensive and costly. Studies show that large language models (LLMs) can facilitate automated annotation, but their performance still falls short of human expert-level accuracy, especially in domain-specific tasks. It has been shown that combining annotations from multiple non-expert annotators can achieve performance comparable to, or even exceeding, that of trained experts. Based on this evidence, we treat AI-generated annotations as contributions from non-expert annotators and combine them using Learning to Rank framework. Our results show significant improvement in overall annotation quality. The proposed method looks promising to reduce reliance on human annotation while maintaining reliable performance for large-scale biomedical applications.
Recent high-complexity agentic systems such as DeepRare perform strongly on rare disease diagnosis benchmarks, but it remains unclear when gains come from structured knowledge access and when they come from parametric LLM knowledge. We compare phenotypebased retrieval, LLM reranking, and unrestricted LLM diagnosis across seven benchmarks covering 10,382 cases. We find a clear performance crossover driven by retrieval coverage?the fraction of cases whose true diagnosis is within the retriever’s top-50: on highcoverage datasets, ontology-based retrieval dominates; on low-coverage datasets, openended LLM diagnosis takes the lead. Building on this, adding an LLM reranker over retrieved candidates further improves accuracy across our patient-case benchmarks, closing most of the remaining gap to agentic systems (within 2 pp on MME and LIRICAL). We trace the crossover to two structural failure modes of ontology-based retrieval?annotation sparsity and phenotypic homogeneity?and show that aggregate scores across mixed benchmarks can hide these qualitatively different diagnostic settings. These findings motivate per-dataset evaluation and hybrid diagnostic systems that combine retrieval, reranking, and parametric LLM generation based on case characteristics.
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries.
Extracting structured cancer registry information from pathology and medical reports is challenging due to heterogeneous reporting styles and implicit clinical reasoning. We propose a modular multi-agent framework that decomposes registry abstraction into semantic chunking, retrieval, field-specific extraction, validation, evaluation, and aggregation stages. The dataset includes 818 annotated cancer cases from Sultan Qaboos University Hospital. Evaluation in this study focuses on breast (n=454) and colorectal (n=174) reports across grade, morphology, TNM staging, and laterality extraction tasks. The framework is compared against prompt-based LLaMA 3.3 baselines using accuracy and weighted/macro F1-score metrics. The proposed framework improved performance in context-dependent tasks, particularly grade extraction, where weighted F1-score increased from 0.71 to 0.78 for breast cancer and from 0.56 to 0.67 for colorectal cancer. Improvements were also observed for colorectal laterality extraction. For other extraction tasks, particularly highly structured tasks such as TNM staging and morphology extraction, the multi-agent framework achieved performance comparable to direct prompting. Although the baseline achieved slightly higher average weighted F1-scores overall, the proposed framework provides improved modularity, traceability, and pipeline-level interpretability through explicit intermediate reasoning stages, supporting error analysis and future clinician-guided refinement.
Resolving contradictions in biomedical literature requires more than factual recall; it demands identifying the hidden variables that explain divergent findings. Existing NLI benchmarks such as MedNLI operate at the sentence level and fail to capture document-level conflicts driven by differences in dosage, cell type, or study design. We introduce BioConflict, a benchmark of 250 expert-annotated paper pairs (500 abstracts) across ten biomedical topics, formalizing three tasks: conflict detection, contextual variable extraction, and consensus synthesis. We evaluate five general-purpose large language models and two domain-specific baselines, finding that general-purpose large language models achieve strong conflict detection (F1 up to 0.89) but exhibit brittle reasoning in synthesis, while domain-specific models lag significantly on all generative tasks. These findings highlight the need for context-aware biomedical AI capable of resolving, not merely retrieving, conflicting scientific evidence.
We investigate how tokenization granularity affects the representation of medical terminology in language models. Prior work links tokenization granularity to downstream performance under contextualized settings for specifically pretrained and fine-tuned models. We instead ask whether this relationship already emerges at the level of isolated term representations across existing pretrained models. We introduce an intrinsic definition retrieval task using UMLS term-definition pairs, with comparison to WordNet. We show that despite substantially heavier fragmentation of medical terminology, the models remain relatively robust in maintaining semantic alignment between medical terms and their definitions. At the same time, tokenization granularity still correlates with retrieval performance, indicating that effects previously observed in downstream biomedical tasks are already reflected at the level of isolated term representations. Encoder models benefit primarily from whole-token preservation, while for decoder LLMs, tokenization effects emerge mainly at deeper retrieval ranks.
Clinical dialogue-to-note generation is challenging because clinically salient evidence is noisy, distributed across turns, and often revised later in the encounter. Direct transcript-only prompting and coarse intermediate scaffolds can therefore suffer from omissions, section leakage, unsupported fill-in, and brittle final-state tracking. We propose Clinical Atomic Propositions (CAPs), a dialogue-aware intermediate representation for faithful clinical note generation. CAPs extract source-grounded clinical assertions while preserving modifiers such as verification status, temporality, speaker/source, and action type. We also study an optional event consolidation layer that groups CAPs into problem-oriented care bundles before note rendering. We evaluate five methods on a 197-case ACI-Bench cohort: a transcript-only baseline, prompt-based reimplementations of Cluster2Sent and MEDSUM-ENT, CAP, and CAP+Event. The main task uses a sectioned-note template, with SOAP-template rendering and transcript-free rendering reported as ablations. We use MEDSUM-ENT-style GPT-R/P/F1 metrics and a proposition-grounded semCAP-R/P/F1 audit to measure concept-level and source-grounded faithfulness, complemented by case-level win/tie/loss analysis and clinician deep review. Results show that CAP improves preservation of transcript-grounded clinical propositions while remaining competitive on concept-level GPT metrics. CAP+Event is not uniformly better than CAP, but qualitative and boundary analyses show when problem-oriented consolidation can improve organization and when compression can introduce omissions. We release code, prompts, intermediate representations, generated notes, and evaluation artifacts at a public repository.
Processing unstructured clinical narratives remains a major challenge in medical Natural Language Processing (NLP), particularly when critical information is embedded within lengthy and heterogeneous reports. Clinical notes often describe key diagnostic and therapeutic events through a verbose narrative, making automatic event identification difficult. In this work, we frame the identification of clinical events as a text segmentation task.We conduct a comparative study of three segmentation strategies applied to oncology reports: (i) a fully regex-based approach, (ii) a cascaded regex?LLM pipeline, and (iii) the same cascade architecture augmented with a recovery mechanism to mitigate LLM rephrasing. Segmentation quality is evaluated using complementary structural metrics (Pk, WindowDiff, Boundary Similarity, Segment Count Accuracy, and Text Overlap IoU), and its impact is also observed on downstream segment tagging, performed to identify the corresponding event type (e.g. surgery, biopsy, imaging, treatment, laboratory).The results demonstrate the high potential of LLM-based approaches, particularly in preserving semantic coherence within segments and generalization on new data sources. However, regex-based segmentation achieves higher performance according to structural segmentation metrics, also leading to better downstream clinical event identification. In general, these results highlight the critical role of context-adaptive high-quality segmentation strategies in the structuring of verbose clinical narratives and in the accurate identification of key patient events.
Clinical reasoning over electronic health records (EHRs) involves heterogeneous operations, including text interpretation, numerical computation, temporal filtering, and guideline-based aggregation. However, many existing LLM-based approaches still cast these heterogeneous operations as a single end-to-end generation process, obscuring their different reliability requirements and making intermediate failures difficult to inspect. We therefore propose a framework based on operation-mechanism alignment that represents clinical reasoning as a directed acyclic graph of typed operations, where each node is assigned to the execution mechanism best suited to its reliability requirements. The framework also preserves structured evidence provenance for intermediate results. Across six clinician-annotated binary decision tasks, the framework outperforms direct prompting, single-step retrieval-augmented prompting, and chain-of-thought baselines, supporting operation-mechanism alignment as a practical design principle for reliable clinical reasoning over EHRs.
Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.
Biomedical retrieval-augmented LLMs are often evaluated under helpful retrieved context, but in practice the evidence can also be misleading or internally conflicting. This paper studies uncertainty under those harder settings using the HealthContradict benchmark and six open-weight models. We evaluate five controlled evidence conditions: no context, correct-only context, incorrect-only context, and two mixed conditions that contain the same correct and contradictory documents in opposite orders. Correct evidence improves both accuracy and calibration, while incorrect evidence substantially degrades both. Under conflicting evidence, document order also matters: reversing the order of the same two documents changes 11.4%–25.2% of predictions and consistently reduces performance when the incorrect document appears first. We further evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, incorrect-only and incorrect-first conflict, this score improves selective accuracy over confidence-only abstention, with mean gains of 7.2–33.4 and 3.6–14.4 points across 75%, 50%, and 25% coverage. These results show that biomedical RAG systems should be evaluated not only under helpful retrieval, but also under misleading and conflicting evidence.
Research Domain Criteria (RDoC) is a National Institute of Mental Health framework for studying mental disorders by integrating information across genetics, circuits, and behavior. Manually curating biomedical abstracts relevant to RDoC is a significant challenge due to semantically overlapping construct definitions (e.g., "Acute Threat," "Potential Threat," and "Sustained Threat") and the exponential growth of biomedical literature. This study compares two modeling strategies, domain-adapted fine-tuning and in-context prompting, across two RDoC-related subtasks from the official BioNLP-OST 2019 RDoC shared task. For Task 1, unlabeled PubMed abstracts are retrieved and ranked by relevance to eight of the RDoC constructs. We compare a TF-IDF baseline against ModernBERT and Llama (zero-shot and five-shot) using Mean Average Precision (MAP). For Task 2, the objective is to identify the single most relevant sentence from an abstract for a given construct, evaluated using per-construct accuracy. The fine-tuning track performs end-to-end fine-tuning of BioBERT, PubMedBERT, ModernBERT, and RoBERTa using a cross-encoder input format and per-construct grid search. These are compared against the in-context learning of several open-source language models. Both our approaches are competitive against the best-performing team’s score from the BioNLP-OST 2019 RDoC shared task. Taken together, these findings suggest that five-shot prompted LLMs and domain-adapted fine-tuned transformers are viable tools for semi-automating the expert annotation in RDoC curation.
Clinical sources and patient-authored reviews often describe antidepressant side effects in different ways, but these differences are rarely measured directly. We present ClinPeer-AE, a linked dataset for comparing side-effect evidence from PubMed, ClinicalTrials.gov, WebMD, and Drugs.com while preserving source identity. Across five widely prescribed antidepressants, we find low overlap between clinical and peer sources, large differences in relative emphasis, and evidence that many peer-only effects also appear in U.S. Food and Drug Administration Adverse Event Reporting System (FAERS) reports. These findings suggest that patient reviews provide useful context about recurring medication experiences and offer a complementary view of how side effects are described outside formal clinical settings.
Retrieval-augmented generation (RAG) holds promise for clinical question answering over electronic health records (EHRs), but existing systems treat retrieval as an opaque subroutine, limiting auditability and reliability in patient care workflows. We introduce a deterministic multi-stage retrieval pipeline for longitudinal EHR question answering that decomposes retrieval into four distinct, ablated stages where each stage is instrumented with diagnostic metrics, making the flow of clinical evidence measurable and auditable at every step. Evaluated on a broad LLM-annotated cohort and an expert-annotated cardiovascular benchmark developed alongside clinicians from real ICU records, the full pipeline achieves 22-23% relative recall gain over a strong dense retrieval baseline across both cohorts, with consistent improvements in downstream answer quality. The pipeline’s deterministic and transparent design addresses a critical gap in clinical NLP: retrieval systems that clinicians and researchers can not only rely on, but inspect, audit, and build upon for real-world deployment.
Transformer-based models such as PLM-CA achieve strong performance for automatic ICD coding, but their attention weights do not provide faithful explanations of their predictions. This is a major limitation for electronic medical records, where users often need concise and trustworthy evidence for each assigned code. To address this issue, we jointly train a sentence extractor and an ICD code classifier such that predictions are based only on the extracted sentences. As a result, the extracted sentences serve as faithful rationales for each predicted code and substantially reduce the effort required to inspect long medical records. Experiments on MIMIC-III show that our method approaches the performance of a transformer baseline that processes the full record while using only a small fraction of the document.
Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.
Molecular representation learning aims to capture chemically meaningful structures for various downstream tasks such as accurate molecular property prediction. However, incorporating functional group (FG) information into SMILES-based models remains challenging. The absence of explicit alignment between graph-defined FG atom sets and tokens in sequence prevents complete substructure masking, while multiple valid SMILES forms of the same molecule lead to inconsistent FG representations in token space. To address these challenges, we propose FACT (Functional Group Alignment and Consistency in Token Space), an end-to-end framework for structure-aware SMILES-based representation learning. FACT introduces an atom?token alignment module for complete FG span masking during pre-training and enforces FG consistency across different SMILES forms during fine-tuning. Experiments on MoleculeNet benchmarks show that FACT achieves state-of-the-art or competitive performance on eight tasks, demonstrating the effectiveness of alignment and consistency learning for molecular representation.
Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs; Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B; via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA achieves higher F1 than both GPT-4o and GPT-5 (up to 12% gain) at 44.5x lower cost using just 1,008 training examples, representing a compelling cost-quality trade-off. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.
Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document–query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.
Much of our knowledge about anatomy and physiology is found in text format in research papers and medical textbooks. For an information system to have access to this knowledge, extracting and translating it into a computable format that can be stored in an ontology or knowledge graph is advantageous. Unfortunately, existing text mining corpora, which are needed to train and evaluate data mining models, are old and consist almost entirely of research papers, which rarely contain complete information needed to capture complex normal physiological processes and, subsequently, understand the pathophysiology of a disease. As a first step to filling in this gap, we have developed a guide for annotating medical textbooks for physiological events and entities involved in these events. In addition to providing our guidelines and describing the guideline development process, we analyze the coverage of normal physiology in existing ontologies.
Clinical Concept Normalization is essential for clinical research applications involving trial protocols, such as patient-trial matching. Existing approaches focus heavily on specific domains and need large, annotated datasets. To address these challenges, we propose CENT, a context engineering framework that combines semantic matching for candidate retrieval and Large Language Model (LLM) prompting for disambiguation. We applied CENT on a publicly available dataset of procedures normalized to Current Procedural Terminology (CPT) concepts and evaluated the framework using both binary and hierarchical metrics that take into account hierarchical characteristics of predicted codes. CENT achieves superior performance on clinical procedures normalization in both binary and hierarchical metrics compared to semantic matching or LLM-only approaches, without requiring fine-tuning. Advanced prompt strategies, including Chain-of-Thought and Tree-of-Thoughts, achieve high performance at practical cost. We further applied CENT to predict codes in two clinical protocol-derived datasets to validate its performance on noisy procedure texts. CENT is scalable and adaptable for normalization across diverse clinical vocabularies in real-world clinical applications.
Clinical documentation places significant time demands on medical professionals, consumes institutional resources, and is prone to errors that may compromise patient care. Recent advances in LLMs offer promising approaches for automating clinical note generation; however, the impact of different AI architectural designs remains underexplored, particularly for agentic AI systems. This study compares three architectures ? single-LLM, multi-agentic, and swarm-agentic ? for automated SOAP (Subjective, Objective, Assessment, Plan) note generation from doctor?patient dialogues. All approaches employ QLoRA-finetuned Ministral 3 models (3B and 8B parameters) trained on the MedSynth dataset, comprising 10,030 dialogue?note pairs across 2,006 ICD-10 code classes. Performance is evaluated using ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore against a lexical-overlap baseline (dialogue vs. ground-truth SOAP, no inference). Results show that all finetuned models substantially outperform the baseline, while differences between architectural variants remain marginal. The single-LLM setup achieves the strongest performance across all metrics; 3B and 8B variants perform nearly identically on semantic similarity (BERTScore), while ROUGE differences are small but statistically significant. Qualitative inspection further reveals that residual differences across architectures are driven primarily by shared dataset priors rather than by architectural reasoning capacity. The results are based on synthetic data without human evaluation and reflect architectural behavior only.
Retrieval-augmented generation (RAG) reduces hallucination in large language models by grounding outputs in retrieved evidence, but it does not guarantee that the resulting citations support the associated claims. We present VERICITE, a framework for evaluating citation faithfulness in retrieval-augmented medical QA. Our system retrieves PubMed abstracts via the NCBI E-Utilities API, prompts LLMs to generate answers with inline citations, and verifies each citation at the sentence level using a DeBERTa-v3-large NLI model. We evaluate four LLMs on 500 BioASQ questions at retrieval depths of 3 and 5, with extended experiments up to k = 15 and an oracle setting with gold standard documents. Only 27?41% of citation pairs are supported at the sentence level at retrieval depths of 3 and 5, with support rates declining further at larger k. Under the oracle condition, answer quality improves, but citation faithfulness does not substantially improve, suggesting that generation-side citation behavior contributes substantially to unfaithful citations.
This paper presents an overview of the Medical Decision Extraction, Analysis, and Classification task (MedExACT) of BioNLP 2026. The focus of this task is the extraction and labeling of medical decisions in ICU discharge summaries. The task is built on MedDec, a MIMIC-III-based dataset of 451 expert-annotated summaries, and asks systems to extract and classify spans of text that contain medical decisions according to the decision categories defined in the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM). The official ranking combines span F1 and token F1 with a worst-group robustness metric computed over sex, race, and English-proficiency subgroups. MedExACT attracted broad international interest, with 130 official submissions from 36 teams comprising about 60?100 participants, and has improved information extraction performance by nearly 15% over the previous state of the art. The submitted systems predominantly use long-context encoder models, ensemble decoding, boundary-refinement modules, and robustness-aware training or model selection, with the best submitted run reaching a final fairness-based F1 of 0.596.
Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.
Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domain-specific complexity poses challenges for generalist models.In this work, we focus on the domain of autoimmunity where the main entities of interest are autoimmune diseases, autoantibodies (i.e. molecules that may mark or cause these diseases), their molecular targets, their location in the body, and the associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed that we manually annotated for those entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and second, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after fine-tuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at: https://github.com/f-maury/AAbAAC .
Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train?test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.
Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.
Systematic reviews underpin evidence-based medicine but can outdate quickly when new evidence appears. We formulate a novel prediction task: given a review and new studies that have appeared since its publication, predict whether the review’s conclusions will change. A dataset of 3,326 Cochrane review-update pairs is constructed and a range of approaches explored including feature-based baselines, zero and few-shot LLMs, in addition to parameter efficient fine-tuning. Fine-tuning Qwen2.5 14B achieves the highest AUC-ROC (70.4%).
Systematic reviews are fundamental to evidence-based medicine, but the clinical evidence they contain is primarily expressed in unstructured text, making large-scale extraction and reuse difficult. Existing biomedical NLP methods have achieved strong performance on span-level extraction from clinical trials and abstracts; however, these approaches are insufficient for systematic reviews, where evidence is often distributed across multiple studies, sentences, and sections and must be aggregated into normalized document-level attributes. We introduce VaxScope, a benchmark dataset for document-level structured evidence extraction from immunization-related systematic reviews. VaxScope is constructed through an expert-guided semi-automatic annotation pipeline that combines automatic candidate generation with domain expert validation to ensure consistency and annotation quality. We formalize the task as document-level structured extraction, where target labels are defined at the review level and require aggregating evidence beyond isolated textual spans. We further establish baselines for document-level structured extraction using abstract-level input representations and evaluate how access to evidence-grounded contextual input improves performance over abstract-only settings. Baseline experiments show that PubMedBERT achieves the best overall performance (Avg F1: 0.850), with evidence-grounded input improving performance particularly for fields requiring distributed contextual reasoning.
The variation in writing style encapsulates nuanced characteristics, which are often exploited for author or demographic identification. In the medical domain, language models are frequently deployed to capture relevant information from unstructured or complex data, such as clinical notes that often include patients’ medical histories. Such data is largely free-form and unstructured, obtained through diverse clinician?patient interactions. In this work, we present a case study investigating whether variations in clinicians’ writing styles can lead to differences in medical context understanding capabilities for pre-trained language models (PLMs) on downstream tasks, such as medical event classification. Our findings indicate that variation in writing style, characterized by linguistic features, can indeed lead to suboptimal performance in deployed systems. Furthermore, we explore linguistic guided counterfactual reasoning in order to mitigate the impact of writing style variation which suggests LLM-based stylistic normalization to be effective for this purpose.
The exponential growth of biomedical literature has made manual curation of biological interaction networks increasingly difficult. Existing automated biological interaction extraction systems address the scaling challenge but treat extraction as a final step, delivering structured output with limited or no integrated support for biologists to interactively verify, correct and contextually interrogate extracted interactions against their source evidence within the same environment. We present Knowledge-Assisted Literature Mining for Biological Interaction Analysis (KALIMBA), an end-to-end, human-in-the-loop platform that integrates three complementary extraction methods (NLP-only, LLM-only, and hybrid) alongside expert annotation and evidence-grounded conversational querying through retrieval-augmented generation (RAG) chat module driven by a dual-context prompt, within a single unified workflow. Evaluation on a corpus of 40 signaling-focused papers demonstrates that the LLM-only back-end recovers substantially more interactions than the NLP-only approach. RAG chat evaluation by a domain expert confirms that the conversational module provides scientifically grounded responses that support curation decisions beyond what the structured interaction table alone conveys.
Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1–2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.
LLM-based drug–drug interaction (DDI) assessment remains difficult to audit when predictions are not explicitly tied to evidence. While retrieval-augmented generation (RAG) improves grounding, predictions are not guaranteed to be entailed by retrieved items. We present CrossDDI, a verification-first framework that separates LLM-based evidence extraction from deterministic, LLM-free arbitration over DrugBank and PubMed, requiring positive predictions to be linked to explicit supporting evidence. Evaluated on 1,000 DDInter 2.0 pairs under a positive–unlabeled setting, CrossDDI achieves recall of 0.576–0.593 over confirmed positives with interaction prediction rates comparable to RAG, while reducing cross-backbone variation (0.018 vs. 0.066). Analysis identifies literature evidence acquisition and attribution as the primary bottleneck: PubMed retrieval covers only 40.5% of confirmed positives, and Path B-only evidence is substantially less reliable than structured evidence. These results suggest that verification-first architectures can improve traceability and backbone consistency, while broader and more reliable literature evidence is needed to extend coverage beyond structured sources.
Even in the era of large language models (LLMs), biomedical relation extraction (RE) still plays a major role in timely creation of knowledge graphs that further guide biomedical knowledge discovery. The main task in RE is to extract a relation "as expressed" in an input text. At times, crucial definitional information or other auxiliary information about the entities involved may be missing from the input text. Augmenting it from other external textual sources appears helpful on the surface but can be harmful too, as these sources can overwhelm the signal in the original input, leading to false positives or false negatives. To counter this, we leverage a pre-trained biomedical text retriever to augment original inputs with additional instance-specific snippets. This is done through a gating mechanism that allows the retrieved snippets to enhance but not overwhelm the signal from the original input. We evaluate our approach on three standard biomedical relation extraction datasets (CDR, BioRED, and ChemProt) and show consistent improvements (up to 10 F1 points) compared with strong supervised baselines involving both encoder and decoder models. All our code and the datasets used are available for reuse: \url{https://github.com/bionlproc/GRAFT-RE}.
We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.
Systematic reviews of clinical trials require analysts to extract attributes that are rarely stored as ready-made columns. For example, the drug class of an immunotherapy named in a regimen, the additional agents combined with it, or whether a listed endpoint is a primary or secondary outcome. These attributes must be inferred from the visible content of other fields through normalization, classification, or structured extraction, and existing approaches such as direct LLM prompting, text-to-SQL, and agentic pipelines leave this reasoning implicit in a single generation step or pay a heavy execution cost for limited accuracy gains. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, BlendSQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution.
Visual-language models (VLMs) are rapidly advancing on tasks that require visual understanding of text, tables, plots, and diagrams. Yet extracting structured information from text-heavy scientific diagrams remains challenging, as it requires not only OCR but also recovery of layout, grouping, and flow relationships. We study this problem in the context of CONSORT flow diagrams, which summarize participant screening, randomization, follow-up, and analysis in randomized controlled trials. We introduce a 200-example benchmark of PubMed Central diagrams, annotated by a biomedical team specializing in systematic literature reviews and clinical evidence extraction, and evaluate schema-constrained CONSORT extraction across proprietary and open-weight model families. Using structure-aware metrics, we compare single-pass and stepwise extraction strategies. Expert-guided single-pass extraction performs best for proprietary frontier models, with Gemini 3 Pro achieving the strongest overall results, whereas stepwise prompting improves less capable open-weight models on challenging arm-level extraction. These results offer practical deployment guidance and suggest that high-quality schema-constrained extraction is feasible, but not yet solved.
Tabular data is widely used in important areas such as healthcare and finance, but building accurate models in real-world settings faces three main challenges: protecting data privacy, handling distributed data, and maintaining strong performance. Existing methods do not solve these issues together. Converting tabular data into text for Large Language Models (LLMs) can expose sensitive information, struggle with anonymized features and exact numerical values, and require expensive training while often not outperforming traditional tree-based models. In addition, many real-world datasets are spread across different institutions, making centralized training impossible. We propose a federated framework that connects distributed tabular data with LLM reasoning using decision tree rules as privacy-preserving intermediaries. Each client trains a local Random Forest and shares only extracted rules?feature comparisons and thresholds, without revealing raw data. These rules are combined into a global pool, allowing an LLM to generate a better partitioning rule without accessing any original data, adding an extra layer of privacy. Using this rule, each client learns local gradient-based corrections, which are then aggregated. We also show that this process reduces prediction error. Experiments on 12 datasets, including seven medical tasks, show that our method consistently outperforms federated baselines and achieves results close to centralized models.
We introduce MedBench, a benchmark for evaluating medical language models as deliberating agents rather than isolated predictors. MedBench evaluates eight models (4B?32B) on 19,625 questions from six medical QA datasets using Consensus-Aware Model Panel (CAMP), a two-tier protocol in which five 4B?8B models answer independently, revise after observing peer reasoning, and escalate persistent disagreements to larger 20B?32B models. Compared with zero-shot, few-shot, and chain-of-thought baselines, CAMP shows that deliberation is not uniformly accuracy-improving, but reveals interaction-driven behaviors hidden by single-model evaluation. On PubMedQA without external context, the 4B?8B panel outperforms the evaluated 20B?32B individual zero-shot models (54.1% vs. 33.9%), and achieves the best evaluated result with context (75.7%), suggesting that structured interaction can sometimes complement scale. Across five datasets, initial inter-model agreement is positively associated with correctness and serves as a useful difficulty signal. However, on MedXpertQA, unanimous agreement yields only 6.6% accuracy despite 14.4% overall accuracy, suggesting correlated ignorance, where shared biases make consensus misleading. Error analysis shows that most failures are debate-insufficient cases, where incorrect majorities persist despite interaction (93?97%), while debate-harmful cases account for 3?7%. MedBench positions deliberative evaluation as a complement to accuracy-centric benchmarking, measuring when model interaction corrects errors, reinforces shared mistakes, or signals the need for stronger evidence and human review.
dbt mimic omop is a free, open-source resource that converts the MIMIC-IV dataset to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) format on consumer level hardware. CDM approaches are increasingly adopted in both industry and academia due to the need for interoperability and reproducibility, including in clinical NLP tasks such as cohort selection, information extraction, and retrieval-augmented generation. The MIMIC-IV database is among the most widely used critical care research datasets, yet existing pipelines to transform it to OMOP depend on enterprise database infrastructure and complex orchestration, limiting accessibility for practitioners and resource-constrained researchers. We further integrate free-text clinical notes (195.6M clinical annotations) and chest radiographs into the OMOP note nlp and imaging extension tables, making all MIMIC-IV modalities (structured data, free-text, and imaging) accessible through a common data model. This resource generates a more comprehensive dataset than existing alternatives and is intended to be used to aid in system development, testing, and evaluation.
The rapid expansion of biomedical literature makes manual identification of novel drug-disease relationships increasingly difficult. Existing approaches have leveraged LLMs to mine abstracts or construct knowledge graphs for drug repurposing. There are two key limitations: finite context windows for capturing macro-level research trends, and single-pass black-box pipelines make it difficult to verify outputs. This paper proposes a pipeline for discovering new drug targets by combining disease and drug research trends using Large Language Models (LLMs). Our method extracts PICO components from PubMed abstracts, normalizing the Population and Intervention Component to ICD and ATC codes, respectively. A temporal frequency delta matrix is constructed to capture publication count shifts across 2013 to 2022, then used to discover novel drug areas. Compared with the abstract-based baseline, our approach showed qualitative signs of generating combinations that were more closely aligned with observed research trends and, in some cases, more clinically plausible. These findings suggest the potential usefulness of structured trend information for LLM-based exploration, although the differences between the two methods were limited and the results remain preliminary. Future work will focus on validating the consistency and reliability of these candidates.
Family health history (FHx) offers insight into a person’s health and disease risk, but it is largely held within free-text clinical formats that require processing for maximal utility of the data. The rapid deployment of ambient AI scribes and conversational agents in clinical settings necessitates evaluation on dynamic patient-clinician and patient-agent dialogs. To address this gap, we introduce two new datasets of patient FHx dialog documents designed to benchmark information extraction and entity linking. Distinct from clinician-entered datasets, patient-reported dialog data has its own semantic and content characteristics, which need to be studied for more patient-centered healthcare. We contribute a publicly available resource called FHexchange, with new annotations for family members, clinical observations, related entities, and standardized UMLS CUIs, offering the clinical NLP community a robust evaluation bed for emerging generative AI tools.
Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino?English code-switching is pervasive and no prior work has addressed NLP-based dementia detection.We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969–0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.
We present IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu). The dataset extends the MDDial corpus with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors introduced during automatic translation. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation (LoRA) of a quantized small language model, incorporating an optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate IndicMedLM against zero-shot multilingual baselines across ten languages and conduct systematic error analysis, identifying five failure modes: Instruction Drift, Label Collapse, Cross-Domain Confusion, Tokenization Failure, and Paraphrase-over-Label Generation. Results show strong post-processed diagnostic accuracy in Hindi, Marathi, and Bengali, while Assamese, Tamil, and Telugu remain in an extreme failure tier attributable to base-model tokenizer gaps, a finding with direct patient safety implications. Medical expert evaluation confirms the clinical plausibility and safety of the generated consultations.
Three-dimensional Computed Tomography (3D CT) is a cornerstone of precision medicine. Most AI diagnostic models analyze large num bers of CTslices uniformly, treating all slices as equally important. While this has partly accel erated radiologists’workflows, it overlooks that clinically relevant information is often sparsely distributed throughout a volume. Without tar geted or weighted processing, fine-grained cues may be missed and substantial computation wasted on diagnostically uninformative slices. Wepropose aradiologist-simulating framework for selective and efficient 3D CT interpreta tion. Evaluated on a 3D CT dataset covering eight thoracic lesion types, it was compared with state-of-the-art multimodal large language models such as GPT-4o and supervised visual backbones including ViT and ResNet-50. Us ing accuracy, F1-score, AUC, and blind radiolo gist assessment, Screen-CLIP achieved an AUC of 0.87 and F1-score of 0.82, surpassing ViT Base (AUC: 0.84). For report generation, our method outperformed M3D across all metrics, reaching a BLEU-Avg of 29.03, and achieved the highest average Doctors’ Score (6.16/10) in a preliminary human evaluation.
Biomedical language models can generate overly confident clinical statements despite incomplete or ambiguous evidence. We study whether linguistic uncertainty (the hedged epistemic stance expressed in phrases such as "consistent with" or "cannot exclude") is encoded in model representations and can be controlled without retraining. Across six biomedical language models spanning two architectures (causal decoders and bidirectional encoders), we show that uncertainty is captured by robust low-dimensional linear structure in hidden states. We then apply activation steering to manipulate this representation directly, increasing hedged generation in decoder models and inducing targeted uncertainty related shifts in encoder representations. Together, these results show that epistemic stance is not merely a surface linguistic phenomenon but an interpretable and controllable feature of biomedical language model representations, with implications for safer and more calibrated clinical text generation.
We study how simple linguistic features relate to reader preferences in medical question answering. Our dataset contains answers to medical questions ranked in order of quality. We examine eight interpretable features of the answer text: length in words, average words per sentence, percentage of polysyllabic words, medical named entity density, perplexity, coherence, and dependency distance. We find substantial variation across annotators in both the strength and direction of these relationships. Answer length shows some of the strongest associations and predictive signals, but preferences are not consistent across annotators, with some favoring longer answers and others favoring shorter ones. A leave-one-out ablation study shows the relative impact on the predictive accuracy of our models. Overall, these results suggest that linguistic form can influence reader preference in medical text, but that these effects vary across readers and may be more complex than simple linear correlations.
This paper presents an overview of the MedGenVidQA 2026 shared task on medical video question answering, collocated with the 25th BioNLP workshop at ACL 2026. The shared task addressed three related sub-tasks of the medical multimodal (textual and video) question answering: (i) multimodal retrieval tasks, (ii) multimodal answer generation with citations, and (iii) a visual answer localization task. The key theme of the stated task is to develop reliable multimodal question answering systems for consumers and medical professionals by leveraging generative models. A total of nine teams participated in the shared task challenges and submitted a total of forty-three submissions across all tasks. We performed both automated and human assessments to evaluate the submissions. This paper describes the tasks, datasets, evaluation metrics, participation, and baseline systems for all three tasks. Additionally, we summarize the techniques and results of the evaluation of the various approaches explored by the participating teams. Finally, we discuss the key findings and implications for the development of multimodal medical question answering.
This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.

up

pdf (full)
bib (full)
Proceedings of the BioNLP 2026 (Shared Tasks)

This paper describes our participation in the CRF Filling Shared Task 2026, which aims to automatically populate a predefined Case Report Form (CRF) from clinical notes describing patients with dyspnea.We propose a two-stage pipeline based on large language models (LLMs). In the first stage, a few-shot prompted LLM extracts candidate CRF fields from the clinical note and outputs them in a structured JSON format. In the second stage, a separate LLM verifies each extracted field against the original note and removes predictions that are not supported by explicit textual evidence. This verification step aims to reduce false positives generated during extraction.Experiments on the development set show that the verification stage significantly reduces unsupported predictions while preserving most correct extractions, resulting in improved macro F1. On the official test set, the proposed system achieves a macro F1 score of 0.56 for both English and Italian. These results indicate that separating extraction and verification can balance recall-oriented extraction with precision-oriented validation in CRF population tasks.
This work addresses the temporal ordering task of clinical frames in the Basic Life Support (BLS) subset of ClinSkillQA. A two-stage hybrid pipeline based on Qwen2-VL-2B-Instruct in a zero-shot configuration is proposed. In Stage 1, each image is processed independently to extract factual visual evidence, which is then transformed, using deterministic rules, into a structured representation. In Stage 2, ordering is formulated as an ordinal scoring task over procedural stages, with ties broken using PCA applied to multimodal embeddings. Evaluation followed the official benchmark protocol, considering Task Accuracy, Pairwise Accuracy, and BERTScore. In the test phase, the system achieved Task Accuracy = 0.17, Pairwise Micro Accuracy = 0.60, and BERT F1 = 0.71, with complete coverage in both predictions and rationales. The results demonstrate an interpretable and reproducible foundation, although challenges in fine-grained temporal discrimination remain.
Detecting DMRS defense levels in emotionalsupport dialogues is challenging due to severe class imbalance and fine-grained clinical distinctions between adjacent levels, issueswell documented in psychotherapy-orientedNLP surveys (Na et al., 2025). We presentzzucs for PsyDefDetect at BioNLP 2026 (Naet al., 2026a), adopting a data–supervisionco-design strategy. SCCR applies stratifiedresampling to balance support across nine defense levels. CoR–QLoRA encodes clinical rubrics, including task contracts, taxonomy definitions, and boundary cues, into staticprompts for 8B model fine-tuning. Ablationsshow SCCR improves macro-F1 by 4.9 pointsover random oversampling. Our system fromteam zzucs, submitted on CodaBench underthe display name sly_zzu with submission ID652647, achieves 0.3585 macro-F1 on the official blind-test leaderboard LB1. It ranks6th of 21 registered teams with official submissions and surpasses all published 8B baselines by 4.4 F1 points over the strongest 8Bcomparator, Ministral-8B. The code has beenreleased at https://github.com/jackssdd/zzucs_psydefdetect_code.
Multimodal Large Language Models (MLLMs)show strong medical visual understanding,however their capability for continuous per-ception in procedural clinical workflows re-mains underexplored. We present Perceive-and-Plan, a decomposed in-context learningparadigm for clinical skill keyframe reorder-ing. The method separates visual perceptionfrom temporal planning via two stages: (1)structured visual perception with saliency-guided Picture-in-Picture (PiP) compositionthat magnifies critical regions (head, chest)as color-coded insets, and (2) temporal rea-soning with chain-style self-verification viafresh conversation reset and visual-evidenceanchoring (BLS Rules R1-R11). Withoutparameter updates, our system scores 71.43overall (2nd place, ClinSkill QA 2026), with0.86 pairwise accuracy and 1.0 rationale cover-age. Structured prompting with visual saliencyguidance measurably improves MLLMs’ pro-cedural understanding.Our code is pub-lished at https://github.com/NanceTide/clinskillqa-perceive-and-plan.
The ClinSkill QA shared task requires models to recover the temporal order of scrambled clinical keyframes and generate explanations. We propose EvidenceFlow, a structured zero-shot framework based on Qwen2.5-VL that decomposes the task into global overview, local evidence modeling, and ordering decision, with two variants: model-led EvidenceFlow-M and rule-guided EvidenceFlow-R. On the official test set, EvidenceFlow-R achieves better ordering performance, while EvidenceFlow-M produces better explanation quality, revealing a trade-off between ordering stability and rationale generation. EvidenceFlow provides an interpretable zero-shot baseline for clinical keyframe ordering.
This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams.1 A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1).Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59–80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.
We build an ensemble of 10 transformer encoders for the MedExACT 2026 shared task on medical decision span detection. The ensemble is diversified along three training directions: encoder initialization (including domain-adaptive pre-training on clinical text), loss function, and data augmentation with LLM-generated synthetic notes and silver-labeled clinical documents. Greedy forward search selects the combination with the highest validation final score. A BERT-based boundary refiner is applied to the ensemble’s predicted spans to correct offset errors before submission.
We describe the Eraserhead system submitted to the PsyDefDetect shared task at BioNLP 2026, which frames psychological defense level detection as a nine-class utterance classification problem over supportive dialogue. Our system is based on Qwen3-14B and combines clinically informed prompt design, per-label oversampling, and careful inference settings for stable prediction. A central challenge of the task is strong class imbalance, with High-Adaptive responses appearing far more often than several minority classes. This makes it easy for models to favor the majority class and achieve reasonable accuracy while performing poorly on rarer categories. To address this, we iteratively adjusted oversampling targets based on error analysis and predicted label distributions across submission rounds. Our final system achieved an official macro F1 of 0.3418 on Leaderboard 1 and 0.3947 on Leaderboard 2, ranking 7th among the 21 registered teams on both leaderboards. We further analyze the main failure modes of the system, especially the difficulty of distinguishing Minor Image Distorting defenses from High-Adaptive responses and the persistent tendency to over-predict the majority class. These findings highlight the broader difficulty of modeling psychological function from text alone.
Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches an F1 score of .420 on the hidden test set, placing first among 21 registered teams.
We describe our system for the PsyDefDetect shared task at BioNLP 2026, which focuses onclassifying help-seeker utterances in multi-turn supportive conversations into nine psychological defense mechanism levels defined by the Defense Mechanism Rating Scales (DMRS). Our approach fine-tunes roberta-base using a composite training objective that combines focal loss, label smoothing, and squareroot dampened class weights to address the severe label imbalance present in the PSYDEFCONV corpus, where the dominant class constitutes 52% of the training data. The inputrepresentation is constructed by concatenating up to eight dialogue turns with role-specific tags, separated using RoBERTa’s native /s tokens, followed by the target utterance marked using a [TARGET] token. Model selection is performed using macro-F1 based early stopping on a stratified 15% validation split, along with cosine learning rate decay for stable optimization. Our best submission achieves an official Leaderboard 1 (positive classes) macroF1 score of 0.2556, ranking 11th among 21 registered teams.
Extracting medical decisions from discharge summaries is essential for downstream clinical analytics, yet the task remains challenging due to the heterogeneous structure of electronic health records. For the MedExACT track at ACL 2026, we proposed a system that achieved the 4th position. Our approach first applies dynamic section conditioning to capture the contextual dependencies inherent in each document. A transformer backbone is then augmented with category- and section-aware layer mixing, enabling us to fuse global document structure with fine-grained semantic cues. To further improve robustness, we employ an ensemble of instruction-tuned large language models for automatic section extraction, while a fairness-oriented model selection criterion ensures that performance does not degrade on minority demographic subgroups. The resulting system attains a final score of 0.5806 on the held-out test set and demonstrates significant gains over the baseline across all evaluated subpopulations.
Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.
This paper describes our system for the MedEx-ACT 2026 shared task on extracting and classifying medical decisions from ICU discharge summaries. We frame the task as BIO token classification and train 25 diverse transformer models spanning 13 distinct architectures, including Longformer, DeBERTa, RoBERTa, BioBERT, SciBERT, and others. Each model is trained with category-aware oversampling, focal loss, and demographic-group-aware sampling to address class imbalance and promote fairness across patient subgroups. At inference time, we aggregate predictions via text-normalized majority voting, retaining spans agreed upon by at least 6 of 25 models. Our best submission achieves a final score of 0.5554 on the test set, demonstrating that a simple vote-based ensemble over architecturally diverse models outperforms more complex filtering approaches. We find that architectural diversity is a key driver of ensemble quality and that cross-validation is essential for reliable model selection on small clinical datasets.
Understanding procedural skills from visual data is a key challenge in medical AI, especially for tasks that require reasoning over temporal sequences. We report on FBK-NLP’s participation at the ClinSkill QA 2026 shared task, which requires models to arrange shuffled key frames into a coherent sequence of clinical actions and provide explanations for the resulting order. We conduct a systematic study of prompting and reasoning strategies using an open and easily deployable vision-language model (VLM). The central finding of our study is that incorporating keypoint-based representations of people’s body parts substantially improves temporal reasoning behind frame ordering. Furthermore, we show that model performance is highly sensitive to prompt design and to seemingly minor factors such as filename ordering and the inclusion of domain information.
Psychological defense mechanisms play a cru-cial role in shaping human responses duringemotionally charged conversations, yet remainunderexplored in natural language processing.In this work, we address the PSYDEFCONVshared task, which involves classifying defensemechanisms in multi-turn dialogues using clin-ically grounded annotations based on the De-fense Mechanism Rating Scales (DMRS). Wepropose a generative supervised fine-tuningframework that reformulates the task as con-ditional text generation. A pre-trained causallanguage model (Gemma-2-2B) is adapted us-ing parameter-efficient fine-tuning (PEFT) with4-bit quantization, enabling efficient trainingunder limited computational resources. To han-dle class imbalance, we apply random oversam-pling, and we design a prompt-based input rep-resentation to incorporate conversational con-text effectively. Experimental results demon-strate that our generative approach is compet-itive with discriminative baselines while of-fering improved flexibility in modeling sub-tle and context-dependent defensive behaviors.The findings highlight the potential of genera-tive large language models for psychologicallygrounded dialogue understanding tasks.
Psychological defense detection is one of essential present-day challenges in clinical practice. The state-of-the-art natural language processing (NLP) tools aim to automate this task. However, their potential and efficiency remain largely unexplored. This manuscript attempts to address this problem from various perspectives: it first explores the efficiency of direct large language model (LLM)-prompting. Then, it applies NLP techniques for LLM fine-tuning applied to the psychological defense classification task. Finally, it attempts to generate states of mind based on the speaker’s psychological state. The results show that the complexity of the task requires further improvement of the software solutions used.
Automating the classification of psychological defense mechanisms is a critical yet challenging frontier in clinical natural language processing. General-purpose Large Language Models (LLMs) struggle to apply fine-grained ordinal frameworks like the Defense Mechanism Rating Scales due to the implicit nature of clinical cues and a fundamental clinical reasoning gap. These models exhibit severe extreme response bias, systematically gravitating toward the scale’s endpoints while failing to resolve nuanced, mid-level defenses. In this paper, we present our third-place system for the PsyDefDetect Shared Task at BioNLP 2026, designed specifically to overcome this failure mode. We propose a hybrid architecture that synergizes label-flattened generative retrieval with an LLM classifier fine-tuned via the distillation of supervised clinical reasoning traces. This dual approach, grounding decisions in rubric criteria while leveraging task-specific supervision, successfully mitigates the observed bias, achieving an accuracy of 67.37% and a macro-F1 of 39.56%. Our work provides empirical evidence that tightly integrating targeted clinical supervision with dynamic rubric-grounded retrieval significantly outperforms the raw parameter scale of un-tuned foundation models.
Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (9-class utterance classification evaluated via macro F1), our team LinguIUTics1 achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by +7.7 absolute points (+24.4% relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logitbias tuning and ensemble blending. Together, these components close much of the validation–leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to F1=0.797.
This system paper presents the approach of Team TONI-NLP to the PsyDefDetect 2026 shared task. The objective of the task was to classify utterances from helper–seeker conversations into nine categories: seven labels representing progressively higher levels of defensive maturity, one label indicating the absence of a defense mechanism, and one label for cases requiring additional information. We investigated several modern NLP approaches, including prompt engineering, fine-tuning, hierarchical modeling and classification using text embeddings derived from transformer-based models as well as classical embeddings such as TF-IDF. Our results show that ensemble methods performed best among our submitted systems, achieving a macro-F1 score of 0.320 and ranking 9th in the shared task out of 21 teams.
We present the CanSA system for the MedEx-ACT@ACL 2026 shared task, which requires extracting and classifying clinical decisions from ICU discharge summaries into nine DIC-TUM categories. We have developed three approaches: (1) a training-free system which consists of a preprocessing module that normalizes text and an inference engine combining zero shot LLMs with a RAG ensemble, (2) a supervised fine-tuning method which required training, and (3) a training-free retrieval-augmented pipeline employing TF–IDF-based lexical retrieval to surface in-context exemplars from the development corpus, combined with section aware chunking and structured extraction calls to a large language model. Our team’s best submission achieved a Final Score of 0.41, ranking 34th out of 37 on the official test leaderboard.
This paper presents CASPAR, a two-stage approach for the MedExACT shared task on medical decision span extraction and classification from ICU discharge summaries. Stage 1 performs document-level sequence labeling using a sliding-window RoBERTa encoder with BiGRU and CRF to generate candidate spans. Stage 2 applies a lightweight refinement module that revisits each candidate within its surrounding context to revise category assignments and correct span boundaries. The system achieves a final score of 0.5668 on the official leaderboard, substantially outperforming the organizer baseline on span-level F1. In addition to system description, we provides ablation results, repeated-run validation statistics, and subgroup- and error-level analyses that highlight the challenges of exact boundary recovery and confusion in race categories subgroups in clinical decision extraction.
We present our system for the PsyDefDetect shared task, which focuses on detecting and classifying psychological defense mechanisms in peer emotional support conversations. Our core contribution is a hierarchical classification framework that structures prediction as a coarse-to-fine pipeline over a clinically validated label hierarchy, grounded in the Defense Mechanism Rating Scales (DMRS). Through systematic experimentation with flat fine-tuning, few-shot prompting, and hierarchical classification, we demonstrate that explicitly modelling the structured relationships among defense levels offers a more effective alternative to flat classification, achieving a macro F1 of 0.23 on the official test set.
We propose a hierarchical framework for psychological defense mechanism detection in multi-turn dialogues, integrating large language models, retrieval-augmented generation, and heuristic calibration. Our approach decomposes prediction into coarse-to-fine reasoning stages and incorporates dialogue reconstruction, explanation-enhanced retrieval, and hybrid LLM–supervised filtering to address severe label imbalance and implicit, context-dependent labeling. Experiments on the PsyDefDetect dataset show that LLM-based RAG improves performance on minority and ambiguous classes, achieving a Macro F1 of 0.31, while also revealing persistent challenges in fine-grained discrimination of latent psychological constructs.
Automated extraction of medical decisions from clinical notes is a critical step to constructing more granular patient health trajectories than what is currently obtainable from structured healthcare data. Here we present a system designed for the MedExACT shared task that employs an ensemble of BERT-based classifiers to account for demographic diversity when extracting mentions of medical decisions from MIMIC-III discharge summaries. A simple voting strategy combined with architectural diversity is demonstrated to work best when training data is limited.
This paper presents an ensemble of Qwen3.5-4B language models for extracting medical decisions from discharge summaries in the MedDec dataset. The models were trained to annotate discharge summaries with inline XML-like tags. Three different training strategies were used including dynamic fine-tuning, reinforcement learning, and pseudo-label augmentation. By combining predictions based on inter-model agreement, the system improved performance across evaluation metrics, achieving an overall F1 of 0.5942 and ranking second on the test leaderboard. The results also showed stable performance across demographic groups, suggesting fairness for underrepresented populations.
Detecting psychological defense mechanisms in supportive conversations is essential for assisting mental health practitioners. Natural language processing techniques are increasingly integral to such systems, enabling automated classification of defense levels to better understand help-seeker behavior and resistance patterns. In PsyDefDetect at BioNLP 2026, we address the task of nine-class defense level classification on the PSYDEFCONV corpus. We propose a three-stage pipeline combining LLM-based dialogue summarization, domain-specific transformer fine-tuning, and rule-based ensemble prediction. Additionally, we evaluate three mental health domain-specific transformers (Mental-BERT, Mental-RoBERTa, Mental-XLNet) alongside fine-tuned LLMs (Qwen3-4B, Qwen3-1.7B, Mistral-7B under different input conditions. Experimental results on the released test-set gold labels show that our ensemble approach achieves the best performance, reaching 34.69% macro F1 and surpassing the baseline by 4.69 percentage points. On the official PsyDefDetect Leaderboard 1 (labels 1–8), the submitted system achieved a Macro-F1 score of 23.46%, ranking 15th out of 21 teams, while on Leaderboard 2 (labels 0–8), it achieved 30.04%, securing 14th place. These findings demonstrate that domain-specific transformers substantially outperform generic LLM fine-tuning on this specialized clinical task.
Extracting structured medical decisions fromICU discharge summaries is hard because oflong documents, severe category imbalanceacross nine DICTUM decision types, and afairness-aware evaluation that penalizes incon-sistent performance across demographic sub-groups. We present our system for the MedEx-ACT 2026 shared task (Elgaar et al., 2026),which fine-tunes BiomedBERT with a com-posite loss combining label-smoothed cross-entropy, a soft token-F1 auxiliary term, andR-Drop regularization. At inference time weapply a deterministic ensemble: half-offsetsliding-window augmentation across four win-dow configurations, dual-branch logit aggrega-tion from the same checkpoint, per-categorylength calibration on the Anchor Branch, andsparse routing of categories 4 and 7 to a context-weighted specialist branch motivated by theirunusual span-length distributions. Adding R-Drop improved validation Overall_F1 by 1.24points over the CE + soft-F1 baseline, with alarger 1.70-point gain on Worst-Group F1. Ourbest submission achieves Span F1 of 0.4900,Token F1 of 0.6796, and an official Overall_F1of 0.5724, with the African American subgroupas the Worst-Group bottleneck at Base_Score0.5601
Detecting psychological defense mechanisms in therapy dialogue is a clinically valuable but computationally underexplored task. We present our systematic analysis for PsyDefDetect, a shared task at BioNLP@ACL 2026, which frames defense detection as a nine-class utterance-level classification problem based on the Defense Mechanism Rating Scale (DMRS). We systematically evaluate six open-source, instruction-tuned small language models (SLMs, = 9B parameters) in zero-shot and fine-tuning settings, and compare a clinically-grounded prompt against the organizer-provided baseline. Our official submission achieved 59.96% accuracy and 16.28% Macro F1. Post-submission experiments show that fine-tuning combined with 5-fold cross-validation and logit averaging ensemble substantially improves performance, with the best configuration reaching 34.59% Macro F1 and 65.25% accuracy. We find that clinically-grounded prompts outperform bare label definitions, model scale does not consistently improve zero-shot performance, and fine-tuning dramatically recovers even collapsed zero-shot models. Certain defense tiers remain persistently difficult across all settings, pointing to clinical ambiguity at tier boundaries as a more fundamental bottleneck than data imbalance alone.
This paper describes the system submitted by team Aurum to the Medical Decision Extraction, Analysis, and Classification Task (MedExACT) at BioNLP 2026. The task requires the extraction and classification of contiguous text spans representing medical decisions from lengthy ICU discharge summaries. To address the dual challenges of long document lengths and severe class imbalance withina limited training set of 350 notes, we propose a two-pronged strategy. First, we employ a tripartite data augmentation pipeline utilizing rule-based entity replacement, LLM-based contextual paraphrasing, and synthetic note generation to expand the training data to over 2,300 notes. Second, we fine-tune a domain-specific Clinical Longformer model equipped with a sliding-window inference mechanism and Focal Loss to handle sequences up to 2,048 tokens while focusing on rare decision categories. Paired with a targeted post-processing module,our system achieved a Final Score of 0.5251, demonstrating high token-level detection (Token F1: 0.6311) and strong stability across patient demographics.
This paper describes the system developed for the Medical Visual Answer Localization (MVAL) task at MedGenVidQA 2026. Accurately locating surgical or instructional steps in medical videos is inherently challenging due to audio-visual asynchrony and the visual homogeneity of surgical scenes. We propose a Cascade Multi-modal Alignment Framework that integrates Large Language Models (LLMs) to bridge the semantic-temporal gap. Our pipeline utilizes WhisperX for word-level speech transcription to ensure precise textual anchoring. We then employ Gemini3 as a high-level semantic ranker to generate multi-scale textual priors. Crucially, we transform these discrete semantic scores into a continuous 1D Gaussian Soft Prior, which is injected as an attention bias into our cross-modal fusion network. This mechanism preserves global temporal context while guiding the model to focus on query-relevant frames. Our system achieves highly competitive performance on the validation leaderboard, particularly under strict evaluation metrics, reaching an IoU@0.7 of 67.5%.
This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.
This paper presents a system for Task A of the MedGenVidQA 2026 shared task, which requires simultaneously retrieving relevant PubMed documents and medical videos for 60 consumer health topics. The core contribution is a unified multi-stage pipeline that treats video and document retrieval as complementary rather than independent problems.For video retrieval, the system fine-tunes a PubMedBERT bi-encoder on 2,710 MedVidQA training samples using BM25-driven hard negative mining. Video transcripts (833 unique videos) are segmented into overlapping 30-second temporal chunks with a 10-second stride, producing 32,489 indexed chunks. At query time, T5-based query expansion generates enriched queries for BM25 sparse retrieval, while the original query drives FAISS dense retrieval. The two ranked lists are fused via weighted Reciprocal Rank Fusion (RRF, dense weight 0.75, sparse weight 0.25), and a cross-encoder (MiniLM-L-6-v2) re-ranks the top-200 fused candidates to produce the final top-10 videos. For document retrieval, the NCBI PubMed ESearch API is queried using a progressive keyword fallback chain with exponential backoff, ensuring full topic coverage.The system achieves a MAP of 0.3898, Recall@10 of 0.8449, and NDCG@10 of 0.1079, with complete 60/60 topic coverage across both retrieval modalities. Key limitations include reliance solely on transcript text for video retrieval (no visual or audio features) and dependence on a live API for document retrieval.
This paper describes the Pride-Boiler system submitted to MedGenVidQA 2026 Shared Task A, which asks for retrieving relevant PubMed articles and medical instructional videos in response to consumer health queries. Our approach pairs Pyserini BM25 retrieval with LLM-driven query rewriting and a corrective self-verification loop inspired by the Corrective Retrieval-Augmented Generation (CRAG) paradigm. Given a consumer query, the pipeline first asks Google Gemini to generate clinically optimized search text, one targeting PubMed abstracts with MeSH terms and clinical synonyms, and another targeting video subtitles with procedural action language. BM25 retrieves a broad candidate pool, and Gemini then scores each candidate against the original query, blending its relevance judgment with the normalized lexical signal. A quality grader assesses the top results: if they are judged insufficient, the pipeline triggers a corrective cycle with reformulated terminology and retries up to three attempts. The entire workflow is orchestrated as a LangGraph state machine. In the official shared task evaluation, Pride-Boiler ranked first among all participating systems on PubMed article retrieval, achieving an nDCG of 0.6532 and MAP of 0.5550, both exceeding the organizer-provided Text-RR baseline. Our performance on video (text) retrieval achieves 0.5304 in MAP and 0.5927 in nDCG, outperforming other systems but falling below that of baseline, indicating the structural limitations of lexical matching over noisy subtitle text. We release the pipeline code to support reproducibility on GitHub at https://github.com/basilll007/BioNLP.
Medical visual answer localization requires identifying the temporal span in a video where a medical question is answered or visually explained. We present a simple retrieval-and-selection pipeline for Task C that treats visual answer localization as segment-level answer paragraph selection over timestamped video transcripts. Given a question and a segmented transcript, our system prompts DeepSeek to select a contiguous range of transcript segments rather than directly generating timestamps. The final start and end times are then computed deterministically from the selected segment boundaries, decreasing the risk of hallucinated or malformed temporal outputs. To support long videos, we apply overlapping sliding-window prompting and rank candidate ranges using lexical question. In a 20-sample sanity check on test dataset, a completeness-biased configuration achieved an mIoU of 0.3217, while a shorter duration-penalized configuration improved performance to 0.4815. These results suggest that constrained LLM-based segment selection, combined with deterministic timestamp extraction, is a practical baseline for medical visual answer localization.
MedGenVidQA 2026 Task C evaluates visualanswer localization in medical videos. Thesystem receives a video and a question, then returns the start and end time of the visual answer.Our framework used timestamped automaticspeech recognition (ASR) as a proposal sourcerather than as a final boundary label. The framework generated transcript tables, phase maps,lexical and dense candidate windows, schemaconstrained ranking inputs, selective key-framechecks, and a deterministic validation pass forthe final JSON file. The ranker selected amongbounded candidate intervals instead of generating arbitrary timestamps over a full transcript.Each output can be traced to segment identifiers, candidate source families, selected anchors, phase labels, and validation flags. Ourbest run ranked fifth among six participant systems, with 62.50 IoU@0.3, 36.25 IoU@0.5,22.50 IoU@0.7, and 42.57 mIoU. The threshold pattern suggests that coarse temporal retrieval was more reliable than strict start-endlocalization.