uppdf
bib
ACL 2025
Dina Demner-Fushman
|
Sophia Ananiadou
|
Makoto Miwa
|
Junichi Tsujii
pdf
bib
abs
Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain
Shintaro Ozaki
|
Yuta Kato
|
Siyuan Feng
|
Masayo Tomita
|
Kazuki Hayashi
|
Wataru Hashimoto
|
Ryoma Obara
|
Masafumi Oyamada
|
Katsuhiko Hayashi
|
Hidetaka Kamigaito
|
Taro Watanabe
Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications.However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored.Our study focuses on the impact of RAG, specifically examining whether RAG increases the confidence of LLM outputs in the medical domain.We conduct this analysis across various configurations and models.We evaluate confidence by treating the model’s predicted probability as its output and calculating several evaluation metrics which include calibration error method, entropy, best probability, and accuracy.Experimental results across multiple datasets confirmed that certain models possess the capability to judge for themselves whether an inserted document relates to the correct answer. These results suggest that evaluating models based on their output probabilities determine whether they function as generators in the RAG framework.Our approach allows to evaluate whether the models handle retrieved documents.
pdf
bib
abs
Effect of Multilingual and Domain-adapted Continual Pre-training on Few-shot Promptability
Ken Yano
|
Makoto Miwa
Continual Pre-training (CPT) can help pre-trained large language models (LLMs) effectively adapt to new or under-trained domains or low-resource languages without re-training from scratch.Nevertheless, during CPT, the model’s few-shot transfer ability is known to be affected for emergent tasks.We verified this by comparing the performance between the few-shot and fine-tuning settings on the same tasks.We used Llama3-ELAINE-medLLM, which was continually pre-trained on Llama3-8B, targeted for the biomedical domain, and adapted for multilingual languages (English, Japanese, and Chinese).We compared the performance of Llama3-ELAINE-medLLM and Llama3-8B in three emergent tasks: two related domain tasks, entity recognition (NER) and machine translation (MT), and one out-of-domain task, summarization (SUM). Our experimental results show that degradation in few-shot transfer ability does not necessarily indicate the model’s underlying potential on the same task after fine-tuning.
pdf
bib
abs
MedSummRAG: Domain-Specific Retrieval for Medical Summarization
Guanting Luo
|
Yuki Arase
Medical text summarization faces significant challenges due to the complexity and domain-specific nature of the language. Although large language models have achieved significant success in general domains, their effectiveness in the medical domain remains limited. This limitation stems from their insufficient understanding of domain-specific terminology and difficulty in interpreting complex medical relationships, which often results in suboptimal summarization quality. To address these challenges, we propose MedSummRAG, a novel retrieval-augmented generation (RAG) framework that integrates external knowledge to enhance summarization. Our approach employs a fine-tuned dense retriever, trained with contrastive learning, to retrieve relevant documents for medical summarization. The retrieved documents are then integrated with the input text to generate high-quality summaries. Experimental results show that MedSummRAG achieves significant improvements in ROUGE scores on both zero/few-shot and fine-tuned language models, outperforming baseline methods. These findings underscore the importance of RAG and domain adaptation of the retriever for medical text summarization. The source code of this paper can be obtained from: https://github.com/guantingluo98/MedSummRAG
pdf
bib
abs
Enhancing Stress Detection on Social Media Through Multi-Modal Fusion of Text and Synthesized Visuals
Efstathia Soufleri
|
Sophia Ananiadou
Social media platforms generate an enormous volume of multi-modal data, yet stress detection research has predominantly relied on text-based analysis. In this work, we propose a novel framework that integrates textual content with synthesized visual cues to enhance stress detection. Using the generative model DALL·E, we synthesize images from social media posts, which are then fused with text through the multi-modal capabilities of a pre-trained CLIP model. Our approach is evaluated on the Dreaddit dataset, where a classifier trained on frozen CLIP features achieves 94.90% accuracy, and full fine-tuning further improves performance to 98.41%. These results underscore the integration of synthesized visuals with textual data not only enhances stress detection but also offers a robust method over traditional text-only methods, paving the way for innovative approaches in mental health monitoring and social media analytics.
pdf
bib
abs
Fine-tuning LLMs to Extract Epilepsy Seizure Frequency Data from Health Records
Ben Holgate
|
Joe Davies
|
Shichao Fang
|
Joel Winston
|
James Teo
|
Mark Richardson
We developed a new methodology of extracting the frequency of a patient’s epilepsy seizures from unstructured, free-text outpatient clinic letters by: first, devising a singular unit of measurement for seizure frequency; and second, fine-tuning a generative Large Language Model (LLM) on our bespoke annotated dataset. We measured frequency by the number of seizures per month: one seizure or more requires an integer; and less than one a decimal. This approach enables us to track whether a patient”s seizures are improving or not over time. We found fine-tuning improves the F1 score of our best-performing LLM, Ministral-8B-Instruct-2410, by around three times compared to an untrained model. We also found Ministral demonstrated an impressive ability for mathematical reasoning.
pdf
bib
abs
AdaBioBERT: Adaptive Token Sequence Learning for Biomedical Named Entity Recognition
Sumit Kumar
|
Tanmay Basu
Accurate identification and labeling of biomedical entities, such as diseases, genes, chemical and species, within scientific texts are crucial for understanding complex relationships. We propose Adaptive BERT or AdaBioBERT, a robust named entity recognition (NER) model that builds upon BioBERT (Biomedical Bidirectional Encoded Representation from Transformers) based on an adaptive loss function to learn different types of biomedical token sequence. This adaptive loss function combines the standard Cross Entropy (CE) loss and Conditional Random Field (CRF) loss to optimize both token level accuracy and sequence-level coherence. AdaBioBERT captures rich semantic nuances by leveraging pre-trained contextual embeddings from BioBERT. On the other hand, the CRF loss of AdaBioBERT ensures proper identification of complex multi-token biomedical entities in a sequence and the CE loss can capture the simple unigram entities in a sequence. The empirical analysis on multiple standard biomedical coprora demonstrates that AdaBioBERT performs better than the state of the arts for most of the datasets in terms of macro and micro averaged F1 score.’
pdf
bib
abs
Transformer-Based Medical Statement Classification in Doctor-Patient Dialogues
Farnod Bahrololloomi
|
Johannes Luderschmidt
|
Biying Fu
The classification of medical statements in German doctor-patient interactions presents significant challenges for automated medical information extraction, particularly due to complex domain-specific terminology and the limited availability of specialized training data. To address this, we introduce a manually annotated dataset specifically designed for distinguishing medical from non-medical statements. This dataset incorporates the nuances of German medical terminology and provides a valuable foundation for further research in this domain. We systematically evaluate Transformer-based models and multimodal embedding techniques, comparing them against traditional embedding-based machine learning (ML) approaches and domain-specific models such as medBERT.de. Our empirical results show that Transformer-based architectures, such as the Sentence-BERT model combined with a support vector machine (SVM), achieve the highest accuracy of 79.58% and a weighted F1-Score of 78.81%, demonstrating an average performance improvement of up to 10% over domain-specific counterparts. Additionally, we highlight the potential of lightweight ML-models for resource-efficient deployment on mobile devices, enabling real-time medical information processing in practical settings. These findings emphasize the importance of embedding selection for optimizing classification performance in the medical domain and establish a robust foundation for the development of advanced, domain-adapted German language models.
pdf
bib
abs
PreClinIE: An Annotated Corpus for Information Extraction in Preclinical Studies
Simona Doneva
|
Hanna Hubarava
|
Pia Härvelid
|
Wolfgang Zürrer
|
Julia Bugajska
|
Bernard Hild
|
David Brüschweiler
|
Gerold Schneider
|
Tilia Ellendorff
|
Benjamin Ineichen
Animal research, sometimes referred to as preclinical research, plays a vital role in bridging the gap between basic science and clinical applications. However, the rapid increase in publications and the complexity of reported findings make it increasingly difficult for researchers to extract and assess relevant information. While automation through natural language processing (NLP) holds great potential for addressing this challenge, progress is hindered by the absence of high-quality, comprehensive annotated resources specific to preclinical studies. To fill this gap, we introduce PreClinIE, a fully open manually annotated dataset. The corpus consists of abstracts and methods sections from 725 publications, annotated for study rigor indicators (e.g., random allocation) and other study characteristics (e.g., species). We describe the data collection and annotation process, outlining the challenges of working with preclinical literature. By providing this resource, we aim to accelerate the development of NLP tools that enhance literature mining in preclinical research.
pdf
bib
abs
Benchmarking zero-shot biomedical relation triplet extraction across language model architectures
Frederik Gade
|
Ole Lund
|
Marie Lisandra Mendoza
Many language models (LMs) in the literature claim excellent zero-shot and/or few-shot capabilities for named entity recognition (NER) and relation extraction (RE) tasks and assert their ability to generalize beyond their training datasets. However, these claims have yet to be tested across different model architectures. This paper presents a performance evaluation of zero-shot relation triplet extraction (NER followed by RE of the entities) for both small and large LMs, utilizing 13,867 texts from 61 biomedical corpora and encompassing 151 unique entity types. This comprehensive evaluation offers valuable insights into the practical applicability and performance of LMs within the intricate domain of biomedical relation triplet extraction, highlighting their effectiveness in managing a diverse range of relations and entity types. Gemini 1.5 Pro, the largest LM included in the study, was the top-performing zero-shot model, achieving an average partial match micro F1 of 0.492 for NER, followed closely by SciLitLLM 1.5 14B with a score of 0.475. Fine-tuned models generally outperformed others on the corpora they were trained on, even in a few-shot setting, but struggled to generalize across all datasets with similar entity types. No models achieved an F1 score above 0.5 for the RTE task on any dataset, and their scores fluctuated based on the specific class of entity and the dataset involved. This observation highlights that there is still large room for improvement on the zero-shot utility of LMs in biomedical RTE applications.
pdf
bib
abs
RadQA-DPO: A Radiology Question Answering System with Encoder-Decoder Models Enhanced by Direct Preference Optimization
Md Sultan Al Nahian
|
Ramakanth Kavuluru
Extractive question answering over clinical text is a crucial need to help deal with the deluge of clinical text generated in hospitals. While encoder models (e.g., BERT) have been popular for this reading comprehension–style question answering task, recently encoder-decoder models (e.g., T5) are on the rise. There is also the emergence of preference optimization techniques to align decoder-only LLMs with human preferences. In this paper, we combine encoder-decoder models with the direct preference optimization (DPO) method for the RadQA radiology question answering task. Our approach achieves a 12–15 F1 point improvement over previous state-of-the-art models. To the best of our knowledge, this effort is the first to show that DPO method also works for reading comprehension via novel heuristics to generate preference data without human inputs.
pdf
bib
abs
Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts
Elizabeth Schaefer
|
Kirk Roberts
This paper presents a pipeline for mitigating gender bias in large language models (LLMs) used in medical literature by neutralizing gendered occupational pronouns. A set of 379,000 PubMed abstracts from 1965-1980 was processed to identify and modify pronouns tied to professions. We developed a BERT-based model, Modern Occupational Bias Elimination with Refined Training, or MOBERT, trained on these neutralized abstracts, and compared it with 1965BERT, trained on the original dataset. MOBERT achieved a 70% inclusive replacement rate, while 1965BERT reached only 4%. A further analysis of MOBERT revealed that pronoun replacement accuracy correlated with the frequency of occupational terms in the training data. We propose expanding the dataset and refining the pipeline to improve performance and ensure more equitable language modeling in medical applications.
pdf
bib
abs
Error Detection in Medical Note through Multi Agent Debate
Abdine Maiga
|
Anoop Shah
|
Emine Yilmaz
Large Language Models (LLMs) have approached human-level performance in text generation and summarization, yet their application in clinical settings remains constrained by potential inaccuracies that could lead to serious consequences. This work addresses the critical safety weaknesses in medical documentation systems by focusing on detecting subtle errors that require specialized medical expertise. We introduce a novel multi-agent debating framework that achieves 78.8% accuracy on medical error detection, significantly outperforming both single-agent approaches and previous multi-agent systems. Our framework leverages specialized LLM agents with asymmetric access to complementary medical knowledge sources (Mayo Clinic and WebMD), engaging them in structured debate to identify inaccuracies in clinical notes. A judge agent evaluates these arguments based solely on their medical reasoning quality, with agent-specific performance metrics incorporated as feedback for developing situation-specific trust models.
pdf
bib
abs
Accelerating Cross-Encoders in Biomedical Entity Linking
Javier Sanz-Cruzado
|
Jake Lever
Biomedical entity linking models disambiguate mentions in text by matching them with unique biomedical concepts. This problem is commonly addressed using a two-stage pipeline comprising an inexpensive candidate generator, which filters a subset of suitable entities for a mention, and a costly but precise reranker that provides the final matching between the mention and the concept. With the goal of applying two-stage entity linking at scale, we explore the construction of effective cross-encoder reranker models, capable of scoring multiple mention-entity pairs simultaneously. Through experiments on four entity linking datasets, we show that our cross-encoder models provide between 2.7 to 36.97 times faster training speeds and 3.42 to 26.47 times faster inference speeds than a base cross-encoder model capable of scoring only one entity, while achieving similar accuracy (differences between -3.42% to 2.76% Acc@1).
pdf
bib
abs
Advancing Biomedical Claim Verification by Using Large Language Models with Better Structured Prompting Strategies
Siting Liang
|
Daniel Sonntag
In this work, we propose a structured four-step prompting strategy that explicitly guides large language models (LLMs) through (1) claim comprehension, (2) evidence analysis, (3) intermediate conclusion, and (4) entailment decision-making to improve the accuracy of biomedical claim verification. This strategy leverages compositional and human-like reasoning to enhance logical consistency and factual grounding to reduce reliance on memorizing few-Shot exemplars and help LLMs generalize reasoning patterns across different biomedical claim verification tasks. Through extensive evaluation on biomedical NLI benchmarks, we analyze the individual contributions of each reasoning step. Our findings demonstrate that comprehension, evidence analysis, and intermediate conclusion each play distinct yet complementary roles. Systematic prompting and carefully designed step-wise instructions not only unlock the latent cognitive abilities of LLMs but also enhance interpretability by making it easier to trace errors and understand the model’s reasoning process. Our research aims to improve the reliability of AI-driven biomedical claim verification.
pdf
bib
abs
A Retrieval-Based Approach to Medical Procedure Matching in Romanian
Andrei Niculae
|
Adrian Cosma
|
Emilian Radoi
Accurately mapping medical procedure names from healthcare providers to standardized terminology used by insurance companies is a crucial yet complex task. Inconsistencies in naming conventions lead to missclasified procedures, causing administrative inefficiencies and insurance claim problems in private healthcare settings. Many companies still use human resources for manual mapping, while there is a clear opportunity for automation. This paper proposes a retrieval-based architecture leveraging sentence embeddings for medical name matching in the Romanian healthcare system. This challenge is significantly more difficult in underrepresented languages such as Romanian, where existing pretrained language models lack domain-specific adaptation to medical text. We evaluate multiple embedding models, including Romanian, multilingual, and medical-domain-specific representations, to identify the most effective solution for this task. Our findings contribute to the broader field of medical NLP for low-resource languages such as Romanian.
pdf
bib
abs
Improving Barrett’s Oesophagus Surveillance Scheduling with Large Language Models: A Structured Extraction Approach
Xinyue Zhang
|
Agathe Zecevic
|
Sebastian Zeki
|
Angus Roberts
Gastroenterology (GI) cancer surveillance scheduling relies on extracting structured data from unstructured clinical texts, such as endoscopy and pathology reports. Traditional Natural Language Processing (NLP) models have been employed for this task, but recent advancements in Large Language Models (LLMs) present a new opportunity for automation without requiring extensive labeled datasets. In this study, we propose an LLM-based entity extraction and rule-based decision support framework for Barrett’s Oesophagus (BO) surveillance timing prediction. Our approach processes endoscopy and pathology reports to extract clinically relevant information and structures it into a standardised format, which is then used to determine appropriate surveillance intervals. We evaluate multiple state-of-the-art LLMs on real-world clinical datasets from two hospitals, assessing their performance in accuracy and running time cost. The results demonstrate that LLMs, particularly Phi-4 and (DeepSeek distilled) Qwen-2.5, can effectively automate the extraction of BO surveillance-related information with high accuracy, while Phi-4 is also efficient during inference. We also compared the trade-offs between LLMs and fine-tuned non-LLMs. Our findings indicate that LLM extraction based methods can support clinical decision-making by providing justifications from report extractions, reducing manual workload, and improving guideline adherence in BO surveillance scheduling.
pdf
bib
abs
Prompting Large Language Models for Italian Clinical Reports: A Benchmark Study
Livia Lilli
|
Carlotta Masciocchi
|
Antonio Marchetti
|
Giovanni Arcuri
|
Stefano Patarnello
Large Language Models (LLMs) have significantly impacted medical Natural Language Processing (NLP), enabling automated information extraction from unstructured clinical texts. However, selecting the most suitable approach requires careful evaluation of different model architectures, such as generative LLMs and BERT-based models, along with appropriate adaptation strategies, including prompting techniques, or fine-tuning. Several studies explored different LLM implementations, highlighting their effectiveness in medical domain, including complex diagnostics patterns as for example in rheumatology. However, their application to Italian remains limited, serving as a key example of the broader gap in non-English language research. In this study, we present a task-specific benchmark analysis comparing generative LLMs and BERT-based models, on real-world Italian clinical reports. We evaluated zero-shot prompting, in-context learning (ICL), and fine-tuning across eight diagnostic categories in the rheumatology area. Results show that ICL improves performance over zero-shot-prompting, particularly for Mixtral and Gemma models. Overall, BERT fine-tuning present the highest performance, while ICL outperforms BERT in specific diagnoses, such as renal and systemic, suggesting that prompting can be a potential alternative when labeled data is scarce.
pdf
bib
abs
QoLAS: A Reddit Corpus of Health-Related Quality of Life Aspects of Mental Disorders
Lynn Greschner
|
Amelie Wührl
|
Roman Klinger
Quality of Life (QoL) refers to a person’s subjective perception of various aspects of their life. For medical practitioners, it is one of the most important concepts for treatment decisions. Therefore, it is essential to understand in which aspects a medical condition affects a patient’s subjective perception of their life. With this paper, we focus on the under-resourced domain of mental health-related QoL, and contribute the first corpus to study and model this concept: We (1) annotate 240 Reddit posts with a set of 11 QoL aspects (such as ‘independence’, ‘mood’, or ‘relationships’) and their sentiment polarity. Based on this novel corpus, we (2) evaluate a pipeline to detect QoL mentions and classify them into aspects using open-domain aspect-based sentiment analysis. We find that users frequently discuss health-related QoL in their posts, focusing primarily on the aspects ‘relationships’ and ‘selfimage’. Our method reliably predicts such mentions and their sentiment, however, detecting fine-grained individual aspects remains challenging. An analysis of a large corpus of automatically labeled data reveals that social media content contains novel aspects pertinent to patients that are not covered by existing QoL taxonomies.
pdf
bib
abs
LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA
Yella Diekmann
|
Chase Fensore
|
Rodrigo Carrillo-Larco
|
Eduard Castejon Rosales
|
Sakshi Shiromani
|
Rima Pai
|
Megha Shah
|
Joyce Ho
The increasing deployment of LLMs in patient-facing medical QA raises concerns about the reliability and safety of their responses. Traditional evaluation methods rely on expert human annotation, which is costly, time-consuming, and difficult to scale. This study explores the feasibility of using LLMs as automated judges for medical QA evaluation. We benchmark LLMs against human annotators across eight qualitative safety metrics and introduce adversarial question augmentation to assess LLMs’ robustness in evaluating medical responses. Our findings reveal that while LLMs achieve high accuracy in objective metrics such as scientific consensus and grammaticality, they struggle with more subjective categories like empathy and extent of harm. This work contributes to the ongoing discussion on automating safety assessments in medical AI and informs the development of more reliable evaluation methodologies.
pdf
bib
abs
Effective Multi-Task Learning for Biomedical Named Entity Recognition
João Ruano
|
Gonçalo Correia
|
Leonor Barreiros
|
Afonso Mendes
Biomedical Named Entity Recognition presents significant challenges due to the complexity of biomedical terminology and inconsistencies in annotation across datasets. This paper introduces SRU-NER (Slot-based Recurrent Unit NER), a novel approach designed to handle nested named entities while integrating multiple datasets through an effective multi-task learning strategy. SRU-NER mitigates annotation gaps by dynamically adjusting loss computation to avoid penalizing predictions of entity types absent in a given dataset. Through extensive experiments, including a cross-corpus evaluation and human assessment of the model’s predictions, SRU-NER achieves competitive performance in biomedical and general-domain NER tasks, while improving cross-domain generalization.
pdf
bib
abs
Can Large Language Models Classify and Generate Antimicrobial Resistance Genes?
Hyunwoo Yoo
|
Haebin Shin
|
Gail Rosen
This study explores the application of generative Large Language Models (LLMs) in DNA sequence analysis, highlighting their advantages over encoder-based models like DNABERT2 and Nucleotide Transformer. While encoder models excel in classification, they struggle to integrate external textual information. In contrast, generative LLMs can incorporate domain knowledge, such as BLASTn annotations, to improve classification accuracy even without fine-tuning. We evaluate this capability on antimicrobial resistance (AMR) gene classification, comparing generative LLMs with encoder-based baselines. Results show that LLMs significantly enhance classification when supplemented with textual information. Additionally, we demonstrate their potential in DNA sequence generation, further expanding their applicability. Our findings suggest that LLMs offer a novel paradigm for integrating biological sequences with external knowledge, bridging gaps in traditional classification methods.
pdf
bib
abs
CaseReportCollective: A Large-Scale LLM-Extracted Dataset for Structured Medical Case Reports
Xiao Yu Cindy Zhang
|
Melissa Fong
|
Wyeth Wasserman
|
Jian Zhu
Case reports provide critical insights into rare and atypical diseases, but extracting structured knowledge remains challenging due to unstructured text and domain-specific terminology. We introduce CaseReportCollective, an LLM-extracted dataset of 85,961 open-access case reports spanning 37 years across 14 medical domains, validated through programmatic and human evaluation. Our dataset reveals key publication and demographic trends, including a significant increase in open-access case reports over the past decade, shifts in focus from oncology to COVID-19, and sex disparities in reporting across different medical conditions. Over time, the gap between male and female case reports has narrowed, suggesting greater equity in case reporting. Using CaseReportCollective, we further explore embedding-based retrieval for similar medical topics through accumulated similarity scores across extracted structured information. We also conducted detailed error analyses on the retrieval ranking, finding that high-reported topics dominate retrieval. Such retrieval is driven by lexical overlap rather than underlying clinical relevance, often failing to distinguish between semantically similar yet mechanistically distinct conditions. Future work should focus on clinical-aware embeddings adjusted for long-tailed case distributions to improve retrieval accuracy.
pdf
bib
abs
Enhancing Antimicrobial Drug Resistance Classification by Integrating Sequence-Based and Text-Based Representations
Hyunwoo Yoo
|
Bahrad Sokhansanj
|
James Brown
Antibiotic resistance identification is essential for public health, medical treatment, and drug development. Traditional sequence-based models struggle with accurate resistance prediction due to the lack of biological context. To address this, we propose an NLP-based model that integrates genetic sequences with structured textual annotations, including gene family classifications and resistance mechanisms. Our approach leverages pretrained language models for both genetic sequences and biomedical text, aligning biological metadata with sequence-based embeddings. We construct a novel dataset based on the Antibiotic Resistance Ontology (ARO), consolidating gene sequences with resistance-related textual information. Experiments show that incorporating domain knowledge significantly improves classification accuracy over sequence-only models, reducing reliance on exhaustive laboratory testing. By integrating genetic sequence processing with biomedical text understanding, our approach provides a scalable and interpretable solution for antibiotic resistance prediction.
pdf
bib
abs
Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models?
Siun Kim
|
Hyung-Jin Yoon
Recent advances in large language models (LLMs) have led to impressive performance on medical question-answering (QA) benchmarks. However, the extent to which these benchmarks reflect real-world clinical capabilities remains uncertain. To address this gap, we systematically analyzed the correlation between LLM performance on major medical QA benchmarks (e.g., MedQA, MedMCQA, PubMedQA, and MMLU medicine subjects) and clinical performance in real-world settings. Our dataset included 702 clinical evaluations of 85 LLMs from 168 studies. Benchmark scores demonsrated a moderate correlation with clinical performance (Spearman’s rho = 0.59), albeit substantially lower than inter-benchmark correlations. Among them, MedQA was the most predictive but failed to capture essential competencies such as patient communication, longitudinal care, and clinical information extraction. Using Bayesian hierarchical modeling, we estimated representative clinical performance and identified GPT-4 and GPT-4o as consistently top-performing models, often matching or exceeding human physicians. Despite longstanding concerns about the clinical validity of medical QA benchmarks, this study offers the first quantitative analysis of their alignment with real-world clinical performance.
pdf
bib
abs
Beyond Citations: Integrating Finding-Based Relations for Improved Biomedical Article Representations
Yuan Liang
|
Massimo Poesio
|
Roonak Rezvani
High-quality scientific article embeddings are essential for tasks like document retrieval, citation recommendation, and classification. Traditional citation-based approaches assume citations reflect semantic similarity—an assumption that introduces bias and noise. Recent models like SciNCL and SPECTER2 have attempted to refine citation-based representations but still struggle with noisy citation edges and fail to fully leverage textual information. To address these limitations, we propose a hybrid approach that combines Finding-Citation Graphs (FCG) with contrastive learning. Our method improves triplet selection by filtering out less important citations and incorporating finding similarity relations, leading to better semantic relationship capture. Evaluated on the SciRepEval benchmark, our approach consistently outperforms citation-only baselines, showing the value of text-based semantic structures. While we do not surpass state-of-the-art models in most tasks, our results reveal the limitations of purely citation-based embeddings and suggest paths for improvement through enhanced semantic integration and domain-specific adaptations.
pdf
bib
abs
Converting Annotated Clinical Cases into Structured Case Report Forms
Pietro Ferrazzi
|
Alberto Lavelli
|
Bernardo Magnini
Case Report Forms (CRFs) are largely used in medical research as they ensure accuracy, reliability, and validity of results in clinical studies. However, publicly available, well-annotated CRF datasets are scarce, limiting the development of CRF slot filling systems able to fill in a CRF from clinical notes. To mitigate the scarcity of CRF datasets, we propose to take advantage of available datasets annotated for information extraction tasks and to convert them into structured CRFs. We present a semi-automatic conversion methodology, which has been applied to the E3C dataset in two languages (English and Italian), resulting in a new, high-quality dataset for CRF slot filling. Through several experiments on the created dataset, we report that slot filling achieves 59.7% for Italian and 67.3% for English on a closed Large Language Models (zero-shot) and worse performances on three families of open-source models, showing that filling CRFs is challenging even for recent state-of-the-art LLMs.
pdf
bib
abs
MuCoS: Efficient Drug–Target Discovery via Multi-Context-Aware Sampling in Knowledge Graphs
Haji Gul
|
Abdul Naim
|
Ajaz Bhat
Accurate prediction of drug–target interactions is critical for accelerating drug discovery. In this work, we frame drug–target prediction as a link prediction task on heterogeneous biomedical knowledge graphs (KG) that integrate drugs, proteins, diseases, pathways, and other relevant entities. Conventional KG embedding methods such as TransE and ComplEx-SE are hindered by their reliance on computationally intensive negative sampling and their limited generalization to unseen drug–target pairs. To address these challenges, we propose Multi-Context-Aware Sampling (MuCoS), a novel framework that prioritizes high-density neighbours to capture salient structural patterns and integrates these with contextual embeddings derived from BERT. By unifying structural and textual modalities and selectively sampling highly informative patterns, MuCoS circumvents the need for negative sampling, significantly reducing computational overhead while enhancing predictive accuracy for novel drug–target associations and drug targets. Extensive experiments on the KEGG50k and PharmKG-8k datasets demonstrate that MuCoS outperforms baselines, achieving up to a 13% improvement in MRR for general relation prediction on KEGG50k, a 22% improvement on PharmKG-8k, and a 6% gain in dedicated drug–target relation prediction on KEGG50k.
pdf
bib
abs
Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models
An Dao
|
Hiroki Teranishi
|
Yuji Matsumoto
|
Florian Boudin
|
Akiko Aizawa
Named Entity Recognition (NER) is crucial for extracting domain-specific entities from text, particularly in biomedical and chemical fields. Developing high-quality NER models in specialized domains is challenging due to the limited availability of annotated data, with manual annotation being a key method of data construction. However, manual annotation is time-consuming and requires domain expertise, making it difficult in specialized domains. Traditional data augmentation (DA) techniques also rely on annotated data to some extent, further limiting their effectiveness. In this paper, we propose a novel approach to synthetic data generation for NER using large language models (LLMs) to generate sentences based solely on a set of example entities. This method simplifies the augmentation process and is effective even with a limited set of entities.We evaluate our approach using BERT-based models on the BC4CHEMD, BC5CDR, and TDMSci datasets, demonstrating that synthetic data significantly improves model performance and robustness, particularly in low-resource settings. This work provides a scalable solution for enhancing NER in specialized domains, overcoming the limitations of manual annotation and traditional augmentation methods.
pdf
bib
abs
PetEVAL: A veterinary free text electronic health records benchmark
Sean Farrell
|
Alan Radford
|
Noura Al Moubayed
|
Peter-John Noble
We introduce PetEVAL, the first benchmark dataset derived from real-world, free-text veterinary electronic health records (EHRs). PetEVAL comprises 17,600 professionally annotated EHRs from first-opinion veterinary practices across the UK, partitioned into training (11,000), evaluation (1,600), and test (5,000) sets with distinct clinic distributions to assess model generalisability. Each record is annotated with International Classification of Disease 11 (ICD-11) syndromic chapter labels (20,408 labels), disease Named Entity Recognition (NER) tags (429 labels), and anonymisation NER tags (8,244 labels). PetEVAL enables evaluating Natural Language Processing (NLP) tools across applications, including syndrome surveillance and disease outbreak detection. We implement a multistage anonymisation protocol, replacing identifiable information with clinically relevant pseudonyms while establishing the first definition of identifiers in veterinary free text. PetEVAL introduces three core tasks: syndromic classification, disease entity recognition, and anonymisation. We provide baseline results using BERT-base, PetBERT, and LLaMA 3.1 8B generative models. Our experiments demonstrate the unique challenges of veterinary text, showcasing the importance of domain-specific approaches. By fostering advancements in veterinary informatics and epidemiology, we envision PetEVAL catalysing innovations in veterinary care, animal health, and comparative biomedical research through access to real-world, annotated veterinary clinical data.
pdf
bib
abs
Virtual CRISPR: Can LLMs Predict CRISPR Screen Results?
Steven Song
|
Abdalla Abdrabou
|
Asmita Dabholkar
|
Kastan Day
|
Pavan Dharmoju
|
Jason Perera
|
Volodymyr Kindratenko
|
Aly Khan
CRISPR-Cas systems enable systematic investigation of gene function, but experimental CRISPR screens are resource-intensive. Here, we investigate the potential of Large Language Models (LLMs) to predict the outcomes of CRISPR screens in silico, thereby prioritizing experiments and accelerating biological discovery. We introduce a benchmark dataset derived from BioGRID-ORCS and manually curated sources, and evaluate the performance of several LLMs across various prompting strategies, including chain-of-thought and few-shot learning. Furthermore, we develop a novel, efficient prediction framework using LLM-derived embeddings, achieving significantly improved performance and scalability compared to direct prompting. Our results demonstrate the feasibility of using LLMs to guide CRISPR screen experiments.
pdf
bib
abs
Overview of the BioLaySumm 2025 Shared Task on Lay Summarization of Biomedical Research Articles and Radiology Reports
Chenghao Xiao
|
Kun Zhao
|
Xiao Wang
|
Siwei Wu
|
Sixing Yan
|
Tomas Goldsack
|
Sophia Ananiadou
|
Noura Al Moubayed
|
Liang Zhan
|
William K. Cheung
|
Chenghua Lin
This paper presents the setup and results of the third edition of the BioLaySumm shared task on Lay Summarization of Biomedical Research Articles and Radiology Reports, hosted at the BioNLP Workshop at ACL 2025. In this task edition, we aim to build on the first two editions’ successes by further increasing research interest in this important task and encouraging participants to explore novel approaches that will help advance the state-of-the-art. Specifically, we introduce the new task of Radiology Report Generation with Layman’s terms, which is parallel to the task of lay summarization of biomedical articles in the first two editions. Overall, our results show that a broad range of innovative approaches were adopted by task participants, including inspiring explorations of latest RL techniques adopted in the training of general-domain large reasoning models.
pdf
bib
abs
Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering
Brandon Colelough
|
Davis Bartels
|
Dina Demner-Fushman
In this paper, we present an overview of CLINIQLINK a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4 978 expert-verified, medical source-grounded question–answer pairs that cover seven formats - true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland’s Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.
pdf
bib
abs
SMAFIRA Shared Task at the BioNLP’2025 Workshop: Assessing the Similarity of the Research Goal
Mariana Neves
|
Iva Sovadinova
|
Susanne Fieberg
|
Celine Heinl
|
Diana Rubel
|
Gilbert Schönfelder
|
Bettina Bert
We organized the SMAFIRA Shared in the scope of the BioNLP’2025 Workshop. Given two articles, our goal was to collect annotations about the similarity of their research goal. The test sets consisted of a list of reference articles and their corresponding top 20 similar articles from PubMed. The task consisted in annotating the similar articles regarding the similarity of their research goal with respect to the one from the corresponding reference article. The assessment of the similarity was based on three labels: "“similar”", "“uncertain”", or "“not similar”". We released two batches of test sets: (a) a first batch of 25 reference articles for five diseases; and (b) a second batch of 80 reference articles for 16 diseases. We collected manual annotations from two teams (RCX and Bf3R) and automatic predictions from two large language models (GPT-4omini and Llama3.3). The preliminary evaluation showed a rather low agreement between the annotators, however, some pairs could potentially be part of a future dataset.
pdf
bib
abs
Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records
Sarvesh Soni
|
Soumya Gayen
|
Dina Demner-Fushman
This paper presents an overview of the ArchEHR-QA 2025 shared task, which was organized with the 24th BioNLP Workshop at ACL 2025. The goal of this shared task is to develop automated responses to patients’ questions by generating answers that are grounded in key clinical evidence from patients’ electronic health records (EHRs). A total of 29 teams participated in the task, collectively submitting 75 systems, with 24 teams providing their system descriptions. The submitted systems encompassed diverse architectures (including approaches that select the most relevant evidence prior to answer generation), leveraging both proprietary and open-weight large language models, as well as employing various tuning strategies such as fine-tuning and few-shot learning. In this paper, we describe the task setup, the dataset used, the evaluation criteria, and the baseline systems. Furthermore, we summarize the methodologies adopted by participating teams and present a comprehensive evaluation and analysis of the submitted systems.
uppdf
bib
BioNLP 2025 Shared Tasks
Sarvesh Soni
|
Dina Demner-Fushman
pdf
bib
abs
ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality
Adrian Cuadron Cortes
|
Aimar Sagasti
|
Maitane Urruela
|
Iker De La Iglesia
|
Ane García Domingo-aldama
|
Aitziber Atutxa Salazar
|
Josu Goikoetxea
|
Ander Barrena
This work presents three different approaches to address the ArchEHR-QA 2025 Shared Task on automated patient question answering. We introduce an end-to-end prompt-based baseline and two two-step methods to divide the task, without utilizing any external knowledge. Both two step approaches first extract essential sentences from the clinical text—by prompt or similarity ranking—, and then generate the final answer from these notes. Results indicate that the re-ranker based two-step system performs best, highlighting the importance of selecting the right approach for each subtask. Our best run achieved an overall score of 0.44, ranking 8th out of 30 on the leaderboard, securing the top position in overall factuality.
pdf
bib
abs
UNIBUC-SD at ArchEHR-QA 2025: Prompting Our Way to Clinical QA with Multi-Model Ensembling
Dragos Ghinea
|
Ștefania Rîncu
In response to the ArchEHR-QA 2025 shared task, we present an efficient approach to patient question answering using small, pre-trained models that are widely available to the research community. Our method employs multi-prompt ensembling with models such as Gemma and Mistral, generating binary relevance judgments for clinical evidence extracted from electronic health records (EHRs). We use two distinct prompts (A and B) to assess the relevance of paragraphs to a patient’s question and aggregate the model outputs via a majority vote ensemble. The relevant passages are then summarized using a third prompt (C) with Gemma. By leveraging off-the-shelf models and consumer-grade hardware (1x RTX 5090), we demonstrate that it is possible to improve performance without relying on resource-intensive fine-tuning or training. Additionally, we explore the impact of Chain-of-Thought (CoT) prompting and compare the performance of specialized versus general-purpose models, showing that significant improvements can be achieved through effective use of existing models.
pdf
bib
abs
Loyola at ArchEHR-QA 2025: Exploring Unsupervised Attribution of Generated Text: Attention and Clustering-Based Methods
Rohan Sethi
|
Timothy Miller
|
Majid Afshar
|
Dmitriy Dligach
The increasing volume of patient messages via electronic health record (EHR) portals has contributed significantly to clinician workload. Automating responses to these messages can help alleviate this burden, but it is essential to ensure that the generated responses are grounded in accurate clinical evidence. As part of the ArchEHR-QA 2025 BioNLP ACL shared task, we explore unsupervised methods for generating patient question responses that are both contextually accurate and evidence-backed. We investigate three novel approaches: zero-shot prompting, clustering-based evidence selection, and attention-based evidence attribution, along with a hybrid model that combines clustering and attention. Our methods do not require model fine-tuning and leverage the inherent structure of the input data to identify the most relevant supporting evidence from clinical notes. Our best-performing approach, which integrates clustering and attention, demonstrates a substantial improvement in factuality over baseline zero-shot methods, highlighting the potential of unsupervised strategies for enhancing the clinical utility of large language models in EHR contexts.
pdf
bib
abs
CUNI-a at ArchEHR-QA 2025: Do we need Giant LLMs for Clinical QA?
Vojtech Lanz
|
Pavel Pecina
In this paper, we present our submission to the ArchEHR-QA 2025 shared task, which focuses on answering patient questions based on excerpts from electronic health record (EHR) discharge summaries. Our approach identifies essential sentences relevant to a patient’s question using a combination of few-shot inference with the Med42-8B model, cosine similarity over clinical term embeddings, and the MedCPT cross-encoder relevance model. Then, concise answers are generated on the basis of these selected sentences. Despite not relying on large language models (LLMs) with tens of billions of parameters, our method achieves competitive results, demonstrating the potential of resource-efficient solutions for clinical NLP applications.
pdf
bib
abs
WisPerMed at ArchEHR-QA 2025: A Modular, Relevance-First Approach for Grounded Question Answering on Eletronic Health Records
Jan-Henning Büns
|
Hendrik Damm
|
Tabea Pakull
|
Felix Nensa
|
Elisabeth Livingstone
Automatically answering patient questions based on electronic health records (EHRs) requires systems that both identify relevant evidence and generate accurate, grounded responses. We present a three-part pipeline developed by WisPerMed for the ArchEHR-QA 2025 shared task. First, a fine-tuned BioClinicalBERT model classifies note sentences by their relevance using synonym-based and paraphrased data augmentation. Second, a constrained generation step uses DistilBART-MedSummary to produce faithful answers strictly limited to top-ranked evidence. Third, we align each answer sentence to its supporting evidence via BiomedBERT embeddings and ROUGE-based similarity scoring to ensure citation transparency. Our system achieved a 35.0% overall score on the hidden test set, outperforming the organizer’s baseline by 4.3 percentage points. Gains in BERTScore (+44%) and SARI (+119%) highlight substantial improvements in semantic accuracy and relevance. This modular approach demonstrates that enforcing evidence-awareness and citation grounding enhances both answer quality and trustworthiness in clinical QA systems.
pdf
bib
abs
heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation
Ashish Chouhan
|
Michael Gertz
This paper presents the approach of our team called heiDS for the ArchEHR-QA 2025 shared task. A pipeline using a retrieval augmented generation (RAG) framework is designed to generate answers that are attributed to clinical evidence from the electronic health records (EHRs) of patients in response to patient-specific questions. We explored various components of a RAG framework, focusing on ranked list truncation (RLT) retrieval strategies and attribution approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a query-dependent-k retrieval strategy, including the existing surprise and autocut methods and two new methods proposed in this work, autocut* and elbow. The experimental results show the benefits of our strategy in producing factual and relevant answers when compared to a fixed-k.
pdf
bib
abs
UniBuc-SB at ArchEHR-QA 2025: A Resource-Constrained Pipeline for Relevance Classification and Grounded Answer Synthesis
Sebastian Balmus
|
Dura Bogdan
|
Ana Sabina Uban
We describe the UniBuc-SB submission to the ArchEHR-QA shared task, which involved generating grounded answers to patient questions based on electronic health records. Our system exceeded the performance of the provided baseline, achieving higher performance in generating contextually relevant responses. Notably, we developed our approach under constrained computational resources, utilizing only a single NVIDIA RTX 4090 GPU. We refrained from incorporating any external datasets, relying solely on the limited training data supplied by the organizers. To address the challenges posed by the low-resource setting, we leveraged off-the-shelf pre-trained language models and fine-tuned them minimally, aiming to maximize performance while minimizing overfitting.
pdf
bib
abs
KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering
Adam Kovacs
|
Paul Schmitt
|
Gabor Recski
We present a lightweight, domain‐agnostic verbatim pipeline for evidence‐grounded question answering. Our pipeline operates in two steps: first, a sentence-level extractor flags relevant note sentences using either zero-shot LLM prompts or supervised ModernBERT classifiers. Next, an LLM drafts a question-specific template, which is filled verbatim with sentences from the extraction step. This prevents hallucinations and ensures traceability. In the ArchEHR‐QA 2025 shared task, our system scored 42.01%, ranking top‐10 in core metrics and outperforming the organiser’s 70B‐parameter Llama‐3.3 baseline. We publicly release our code and inference scripts under an MIT license.
pdf
bib
abs
LAILab at ArchEHR-QA 2025: Test-time scaling for evidence selection in grounded question answering from electronic health records
Tuan Dung Le
|
Thanh Duong
|
Shohreh Haddadan
|
Behzad Jazayeri
|
Brandon Manley
|
Thanh Thieu
This paper presents our approach to the ArchEHR shared task on generating answers to real-world patient questions grounded in evidence from electronic health records (EHRs). We investigate the zero-shot capabilities of general-purpose, domain-agnostic large language models (LLMs) in two key aspects: identifying essential supporting evidence and producing concise, coherent answers. To this aim, we propose a two-stage pipeline: (1) evidence identification via test-time scaling (TTS) and (2) generating the final answer conditioned on selected evidences from the previous stage.Our approach leverages high-temperature sampling to generate multiple outputs during the evidence selection phase. This TTS-based approach effectively explore more potential evidences which results in significant improvement of the factuality score of the answers.
pdf
bib
abs
UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting
Sara Shields-Menard
|
Zach Reimers
|
Joshua Gardner
|
David Perry
|
Anthony Rios
We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician’s question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.
pdf
bib
abs
UTSamuel at ArchEHR-QA 2025: A Clinical Question Answering System for Responding to Patient Portal Messages Using Generative AI
Samuel Reason
|
Liwei Wang
|
Hongfang Liu
|
Ming Huang
Responding to patient portal messages places a substantial burden on clinicians. To mitigate this, automatically generating answers to patient questions by considering their medical records is a critical solution. In this study, we proposed a clinical question answering system for the BioNLP 2025 Shared Task on Grounded Electronic Health Record Question Answering. The system processed each patient message case by selecting relevant sentences as evidences from the associated clinical notes and generating a concise, medically accurate answer to the patient’s question. A generative AI model from OpenAI (GPT-4o) was leveraged to assist with sentence selection and answer generation. Each response is grounded in source text, limited to 75 words, and includes sentence-level citations. The system was evaluated on 100 test cases using alignment, citation, and summarization metrics. Our results indicate the significant potential of the clinical question answering system based on generative AI models to streamline communication between patients and healthcare providers by automatically generating responses to patient messages.
pdf
bib
abs
LAMAR at ArchEHR-QA 2025: Clinically Aligned LLM-Generated Few-Shot Learning for EHR-Grounded Patient Question Answering
Seksan Yoadsanit
|
Nopporn Lekuthai
|
Watcharitpol Sermsrisuwan
|
Titipat Achakulvisut
This paper presents an approach to answering patient-specific medical questions using electronic health record (EHR) grounding with ArchEHR-QA 2025 datasets. We address medical question answering as an alignment problem, focusing on generating responses factually consistent with patient-specific clinical notes through in-context learning techniques. We show that LLM-generated responses, used as few-shot examples with GPT-4.1 and Gemini-2.5-Pro, significantly outperform baseline approaches (overall score = 49.1), achieving strict precision, recall, and F1-micro scores of 60.6, 53.6, and 56.9, respectively, on the ArchEHR-QA 2025 test leaderboard. It achieves textual similarity between answers and essential evidence using BLEU, ROUGE, SARI, BERTScore, AlignScore, and MEDCON scores of 6.0, 32.1, 65.8, 36.4, 64.3, and 43.6, respectively. Our findings highlight the effectiveness of combining EHR grounding with few-shot examples for personalized medical question answering, establishing a promising approach for developing accurate and personalized medical question answering systems. We release our code at https://github.com/biodatlab/archehr-qa-lamar.
pdf
bib
abs
Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
Sai Prasanna Teja Reddy Bogireddy
|
Abrar Majeedi
|
Viswanath Gajjala
|
Zhuoyan Xu
|
Siddhant Rai
|
Vaishnav Potlapalli
Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.
pdf
bib
abs
UIC at ArchEHR-QA 2025: Tri-Step Pipeline for Reliable Grounded Medical Question Answering
Mohammad Arvan
|
Anuj Gautam
|
Mohan Zalake
|
Karl M. Kochendorfer
Automated response generation from electronic health records (EHRs) holds potential to reduce clinician workload, but it introduces important challenges related to factual accuracy and reliable grounding in clinical evidence. We present a structured three-step pipeline that uses large language models (LLMs) for evidence classification, guided response generation, and iterative quality control. To enable rigorous evaluation, our framework combines traditional reference-based metrics with a claim-level “LLM-as-a-Judge” methodology. On the ArchEHR-QA benchmark, our system achieves 82.0 percent claim-level evidence faithfulness and 51.6 percent citation-level factuality, demonstrating strong performance in generating clinically grounded responses. These findings highlight the utility of structured LLM pipelines in healthcare applications, while also underscoring the importance of transparent evaluation and continued refinement. All code, prompt templates, and evaluation tools are publicly available.
pdf
bib
abs
DMIS Lab at ArchEHR-QA 2025: Evidence-Grounded Answer Generation for EHR-based QA via a Multi-Agent Framework
Hyeon Hwang
|
Hyeongsoon Hwang
|
Jongmyung Jung
|
Jaehoon Yun
|
Minju Song
|
Yein Park
|
Dain Kim
|
Taewhoo Lee
|
Jiwoong Sohn
|
Chanwoong Yoon
|
Sihyeon Park
|
Jiwoo Lee
|
Heechul Yang
|
Jaewoo Kang
The increasing utilization of patient portals has amplified clinicians’ workloads, primarily due to the necessity of addressing detailed patient inquiries related to their health concerns. The ArchEHR-QA 2025 shared task aims to alleviate this burden by automatically generating accurate, evidence-grounded responses to patients’ questions based on their Electronic Health Records (EHRs). This paper presents a six-stage multi-agent framework specifically developed to identify essential clinical sentences for answering patient questions, leveraging large language models (LLMs). Our approach begins with OpenAI’s o3 model generating focused medical context to guide downstream reasoning. In the subsequent stages, GPT-4.1-based agents assess the relevance of individual sentences, recruit domain experts, and consolidate their judgments to identify essential information for constructing coherent, evidence-grounded responses. Our framework achieved an Overall Factuality score of 62.0 and an Overall Relevance Score of 52.9 on the development set, and corresponding scores of 58.6 and 48.8, respectively, on the test set.
pdf
bib
abs
CogStack-KCL-UCL at ArchEHR-QA 2025: Investigating Hybrid LLM Approaches for Grounded Clinical Question Answering
Shubham Agarwal
|
Thomas Searle
|
Kawsar Noor
|
Richard Dobson
We present our system for the ArchEHR shared task, which focuses on answering clinical and patient-facing questions grounded in real-world EHR data. Our core contribution is a 2-Stage prompting pipeline that separates evidence selection from answer generation while employing in-context learning strategies. Our experimentation leveraged the open-weight Gemma-v3 family of models, with our best submission using the Gemma-12B model securing 5th place overall on the unseen test set. Through systematic experimentation, we demonstrate the effectiveness of task decomposition in improving both factual accuracy and answer relevance in grounded clinical question answering.
pdf
bib
abs
SzegedAI at ArchEHR-QA 2025: Combining LLMs with traditional methods for grounded question answering
Soma Nagy
|
Bálint Nyerges
|
Zsombor Kispéter
|
Gábor Tóth
|
András Szlúka
|
Gábor Kőrösi
|
Zsolt Szántó
|
Richárd Farkas
In this paper, we present the SzegedAI team’s submissions to the ArchEHR-QA 2025 shared task. Our approaches include multiple prompting techniques for large language models (LLMs), sentence similarity methods, and traditional feature engineering. We are aiming to explore both modern and traditional solutions to the task. To combine the strengths of these diverse methods, we employed different ensembling strategies.
pdf
bib
abs
LIMICS at ArchEHR-QA 2025: Prompting LLMs Beats Fine-Tuned Embeddings
Adam Remaki
|
Armand Violle
|
Vikram Natraj
|
Étienne Guével
|
Akram Redjdal
In this paper, we investigated two approaches to clinical question-answering based on patient-formulated questions, supported by their narratives and brief medical records. The first approach leverages zero- and few-shot prompt engineering techniques with GPT-based Large Language Models (LLMs), incorporating strategies such as prompt chaining and chain-of-thought reasoning to guide the models in generating answers. The second approach adopts a two-steps structure: first, a text-classification stage uses embedding-based models (e.g., BERT variants) to identify sentences within the medical record that are most relevant to the given question; then, we prompt an LLM to paraphrase them into an answer so that it is generated exclusively from these selected sentences. Our empirical results demonstrate that the first approach outperforms the classification-guided pipeline, achieving the highest score on the development set and the test set using prompt chaining. Code: github.com/armandviolle/BioNLP-2025
pdf
bib
abs
razreshili at ArchEHR-QA 2025: Contrastive Fine-Tuning for Retrieval-Augmented Biomedical QA
Arina Zemchyk
We present a retrieval-augmented system for the ArchEHR-QA 2025 shared task, which focuses on generating concise, medically accurate answers to clinical questions based on a patient’s electronic health record (EHR). A key challenge is following a strict cita- tion format that references relevant sentence IDs. To improve retrieval, we fine-tuned an all-MiniLM-L6-v2 embedding model using contrastive learning on over 2,300 question–sentence triplets, with DoRA for efficient adaptation. Sentences were selected using cosine similarity thresholds and passed into a quantized Mistral-7B-Instruct model along with a structured prompt. Our system achieved similar relevance to the baseline but lower overall performance (19.3 vs. 30.7), due to issues with citation formatting and generation quality. We discuss limitations such as threshold tuning, prompt-following ability, and model size, and suggest future directions for improving structured biomedical QA.
pdf
bib
abs
DKITNLP at ArchEHR-QA 2025: A Retrieval Augmented LLM Pipeline for Evidence-Based Patient Question Answering
Provia Kadusabe
|
Abhishek Kaushik
|
Fiona Lawless
This paper describes our submission for the BioNLP ACL 2025 Shared task on grounded Question Answering (QA) from Electronic Health Records (EHRs). The task aims to automatically generate answers to patients’ health related questions that are grounded in the evidence from their clinical notes. We propose a two stage retrieval pipeline to identify relevant sentences to guide response generation by a Large Language Model (LLM). Specifically, our approach uses a BioBERT based bi-encoder for initial retrieval, followed by a re-ranking step using a fine-tuned cross-encoder to enhance retrieval precision. The final set of selected sentences serve as an input to Mistral 7B model which generates answers through few-shot prompting. Our approach achieves an overall score of 31.6 on the test set, outperforming a substantially larger baseline model LLaMA 3.3 70B (30.7), which demonstrates the effectiveness of retrieval-augmented generation for grounded QA.
pdf
bib
abs
AEHRC at BioLaySumm 2025: Leveraging T5 for Lay Summarisation of Radiology Reports
Wenjun Zhang
|
Shekhar Chandra
|
Bevan Koopman
|
Jason Dowling
|
Aaron Nicolson
Biomedical texts, such as research articles and clinical reports, are often written in highly technical language, making them difficult for patients and the general public to understand. The BioLaySumm 2025 Shared Task addresses this challenge by promoting the development of models that generate lay summarisation of biomedical content. This paper focuses on Subtask 2.1: Radiology Report Generation with Layman’s Terms. In this work, we evaluate two large language model (LLM) architectures, T5-large (700M parameter encoder–decoder model) and LLaMA-3.2-3B (3B parameter decoder-only model). Both models are trained under fully-supervised conditions using the task’s multi-source dataset. Our results show that T5-large consistently outperforms LLaMA-3.2-3B across nine out of ten metrics, including relevance, readability, and clinical accuracy, despite having only a quarter of the parameters. Our T5-based model achieved the top rank in both the open-source and close-source tracks of the subtask 2.1.
pdf
bib
abs
MetninOzU at BioLaySumm2025: Text Summarization with Reverse Data Augmentation and Injecting Salient Sentences
Egecan Evgin
|
Ilknur Karadeniz
|
Olcay Taner Yıldız
In this paper, we present our approach to the BioLaySumm 2025 Shared Task on lay summarization of biomedical research articles, which was conducted as part of the BioNLP Workshop 2025. The aim of the task is to create lay summaries from scientific articles to improve accessibility for a non-expert audience. To this end, we applied preprocessing techniques to clean and standardize the input texts, and fine-tuned Qwen2.5 and Qwen3-based language models for the summarization task. For abstract-based fine-tuning, we investigated whether we can insert salient sentences from the main article into the summary to enrich the input. We also curated a dataset of child-friendly articles with corresponding gold-standard summaries and used large language models to rewrite them into more complex scientific variants to augment our training data with more examples.
pdf
bib
abs
Shared Task at Biolaysumm2025 : Extract then summarize approach Augmented with UMLS based Definition Retrieval for Lay Summary generation.
Aaradhya Gupta
|
Parameswari Krishnamurthy
The paper presents a modular, two‐track lay‐summary generation system for biomedical research articles, evaluated on the PLOS and eLife subsets of the BioLaySumm2025 shared task. In Task 1, it extracts salient sentences via an LLM–based chunking and summarization pipeline, then applies iterative rewriting to produce an accessible summary. In Task 2, it augments that summary with UMLS‐sourced definitions identified by a BioBERT NER model, yielding improved readability and factual consistency, at the cost of slight reductions in n‐gram overlap metrics like ROUGE and BLEU.
pdf
bib
abs
RainCityNLP at BioLaySumm2025: Extract then Summarize at Home
Jen Wilson
|
Michael Pollack
|
Rachel Edwards
|
Avery Bellamy
|
Helen Salgi
As part of the BioLaySumm shared task at ACL 2025, we developed a summarization tool designed to translate complex biomedical texts into layperson-friendly summaries. Our goal was to enhance accessibility and comprehension for patients and others without specialized medical knowledge. The system employed an extractive-then-abstractive summarization pipeline. For the abstractive component, we experimented with two models: Pegasus-XSum and a Falcons.ai model pre-trained on medical data. Final outputs were evaluated using the official BioLaySumm 2025 metrics. To promote practical accessibility, we completed all experimentation on consumer-grade hardware, demonstrating the feasibility of our approach in low-resource settings.
pdf
bib
abs
TLPIQ at BioLaySumm: Hide and Seq, a FLAN-T5 Model for Biomedical Summarization
Melody Bechler
|
Carly Crowther
|
Emily Luedke
|
Natasha Schimka
|
Ibrahim Sharaf
BioLaySumm 2025 is a shared task that aims to automatically generate lay summaries of scientific papers for a wider audience of readers without domain-specific knowledge, making scientific discoveries in the domain of biology and medicine more accessible to the general public. Our submission to the task is a FLAN-T5 base model fine-tuned on the abstract and conclusion of articles and expert-written lay summaries from the shared task’s provided datasets. We find that our system performs competitively in terms of relevance, exceeds the baseline on factuality, but falls short on readability.
pdf
bib
abs
LaySummX at BioLaySumm: Retrieval-Augmented Fine-Tuning for Biomedical Lay Summarization Using Abstracts and Retrieved Full-Text Context
Fan Lin
|
Dezhi Yu
Generating lay summaries of biomedical research remains a time-intensive task, despite their importance in bridging the gap between scientific findings and non-expert audiences. This study introduces a retrieval-augmented fine-tuning framework for biomedical lay summarization, integrating abstract-driven semantic retrieval with LoRA-tuned LLaMA 3.1 models. Abstracts are used as queries to retrieve relevant text segments from full-text articles, which are then incorporated into prompts for supervised fine-tuning. Evaluations on the PLOS and eLife datasets show that this hybrid approach significantly improves relevance and factuality metrics compared to both base models and those tuned individually, while maintaining competitive readability. Prompt design experiments highlight a trade-off between readability and factual accuracy. Our fine-tuned model demonstrates strong performance in relevance and factuality among open-source systems and rivals closed-source models such as GPT, providing an efficient and effective solution for domain-specific lay summarization.
pdf
bib
abs
5cNLP at BioLaySumm2025: Prompts, Retrieval, and Multimodal Fusion
Juan Antonio Lossio-Ventura
|
Callum Chan
|
Arshitha Basavaraj
|
Hugo Alatrista-Salas
|
Francisco Pereira
|
Diana Inkpen
In this work, we present our approach to addressing all subtasks of the BioLaySumm 2025 shared task by leveraging prompting and retrieval strategies, as well as multimodal input fusion. Our method integrates: (1) zero-shot and few-shot prompting with large language models (LLMs); (2) semantic similarity-based dynamic few-shot prompting; (3) retrieval-augmented generation (RAG) incorporating biomedical knowledge from the Unified Medical Language System (UMLS); and (4) a multimodal fusion pipeline that combines images and captions using image-text-to-text generation for enriched lay summarization. Our framework enables lightweight adaptation of pretrained LLMs for generating lay summaries from scientific articles and radiology reports. Using modern LLMs, including Llama-3.3-70B-Instruct and GPT-4.1, our 5cNLP team achieved third place in Subtask 1.2 and second place in Subtask 2.1, among all submissions.
pdf
bib
abs
MIRAGES at BioLaySumm2025: The Impact of Search Terms and Data Curation for Biomedical Lay Summarization
Benjamin Pong
|
J u - H u i Chen
|
Jonathan Jiang
|
Abimael Jimenez
|
Melody Vahadi
Biomedical articles are often inaccessible to non-experts due to their technical complexity. To improve readability and factuality of lay summaries, we built on an extract-then-summarize framework by experimenting with novel extractive summarization strategies and employing Low Rank Adaptation (LoRA) fine-tuning of Meta-Llama-3-8B-Instruct on data selected by these strategies. We also explored counterfactual data augmentation and post-processing definition insertion to further enhance factual grounding and accessibility. Our best performing system treats the article’s title and keywords (i.e. search terms) as a single semantic centroid and ranks sentences by their semantic similarity to this centroid. This constrained selection of data serves as input for fine-tuning, achieving marked improvements in readability and factuality of downstream abstractive summaries while maintaining relevance. Our approach highlights the importance of quality data curation for biomedicallay summarization, resulting in 4th best overall performance and 2nd best Readability performance for the BioLaySumm 2025 Shared Task at BioNLP 2025.
pdf
bib
abs
SUWMIT at BioLaySumm2025: Instruction-based Summarization with Contrastive Decoding
Priyam Basu
|
Jose Cols
|
Daniel Jarvis
|
Yongsin Park
|
Daniel Rodabaugh
In the following paper, we present our team’s approach to subtask 1.1 of the BioLaySumm 2025 shared task, which entails the automated generation of lay summaries from biomedical articles. To this end, we experiment with a variety of methods for text preprocessing, extractive summarization, model fine-tuning, and abstractive summarization. Our final results are generated on a fine-tuned Llama 3.1 Instruct (8B) model, notably achieving top scores on two out of four relevance metrics, as well as the highest overall ranking among this year’s participating teams on the plain lay summarization subtask.
pdf
bib
abs
BDA-UC3M @ BioLaySumm: Efficient Lay Summarization with Small-Scale SoTA LLMs
Ilyass Ramzi
|
Isabel Bedmar
This paper presents an efficient system for the BioLaySumm 2025 Shared Task on biomedical lay summarization. The approach leverages compact, state-of-the-art language models (4–7 billion parameters), including Gemma3 4B, Qwen3 4B, and GPT-4.1-mini, optimized for relevance, readability, and factuality. Through dynamic 4-bit quantization, parameter-efficient fine-tuning, advanced extractive preprocessing, and direct preference optimization, the system achieves performance competitive with much larger baselines. Comprehensive experiments on the eLife and PLOS datasets demonstrate that small language models can deliver high-quality, accessible biomedical summaries using modest computational resources. The findings suggest that resource-efficient models can help democratize access to scientific information, supporting broader scientific communication goals.
pdf
bib
abs
KHU_LDI at BioLaySumm2025: Fine-tuning and Refinement for Lay Radiology Report Generation
Nur Alya Dania Binti Moriazi
|
Mujeen Sung
Though access to one’s own radiology reports has improved over the years, the use of complex medical terms makes understanding these reports difficult. To tackle this issue, we explored two approaches: supervised fine-tuning open-source large language models using QLoRA, and refinement, which improves a given generated output using feedback generated by a feedback model. Despite the fine-tuned model outperforming refinement on the test data, refinement showed good results on the validation set, thus showing good potential in the generation of lay radiology reports. Our submission achieved 2nd place in the open track of Subtask 2.1 of the BioLaySumm 2025 shared task.
pdf
bib
abs
CUTN_Bio at BioLaySumm: Multi-Task Prompt Tuning with External Knowledge and Readability adaptation for Layman Summarization
Bhuvaneswari Sivagnanam
|
Rivo Krishnu C H
|
Princi Chauhan
|
Saranya Rajiakodi
In this study, we presented a prompt based layman summarization framework for the biomedical articles and radiology reports developed as part of the BioLaySumm 2025 shared task at the BioNLP Workshop, ACL 2025. For Subtask 1.1 (Plain Lay Summarization), we utilized the abstract as input and employed Meta-LLaMA-3-8B-Instruct with a Tree-of-Thought prompting strategy and obtained 13th rank. In Subtask 1.2 (Lay Summarization with External Knowledge), we adopted an extractive plus prompt approach by combining LEAD-K sentence extraction with Meta-LLaMA-3-8B-Instruct. Medical concepts were identified using MedCAT, and their definitions were taken from Wikipedia to enrich the generated summaries. Our system secured the 2nd position in this subtask. For Subtask 2.1 (Radiology Report Translation), we implemented a Retrieval-Augmented Generation (RAG) approach using the Zephyr model to convert professional radiology reports into layman terms, achieved 3rd place in the shared task.
pdf
bib
abs
Team XSZ at BioLaySumm2025: Section-Wise Summarization, Retrieval-Augmented LLM, and Reinforcement Learning Fine-Tuning for Lay Summaries
Pengcheng Xu
|
Sicheng Shen
|
Jieli Zhou
|
Hongyi Xin
We propose a unified, multi-stage lay summarization pipeline for BioLaySumm 2025 (Subtask 1.1) that (1) selects and summarizes key article sections via BioBART, (2) retrieves K-shot demonstrations using BGE embeddings for in-context Llama 3 8B prompting, (3) applies LoRA adapters to Llama 3 8B for supervised fine-tuning, (4) merges section summaries with a second BioBART pass, and (5) refines outputs through reinforcement learning (PPO & GRPO) using a composite reward of factuality (AlignScore, SummaC), relevance (ROUGE-L, BERTScore), and readability (LENS, FKGL, DCRS, CLI). On PLOS and eLife validation sets, our complete systemreduces DCRS from 9.23 to 8.56 and reduces CLI from 12.98 to 12.65, ranking 3rd in readability. and outperforms llama3 finetune baseline in AlignScore 0.722 to 0.862, ranking 5th in factuality, demonstrating balanced gains across readability, relevance, and factuality.
pdf
bib
abs
VeReaFine: Iterative Verification Reasoning Refinement RAG for Hallucination-Resistant on Open-Ended Clinical QA
Pakawat Phasook
|
Rapepong Pitijaroonpong
|
Jiramet Kinchagawat
|
Amrest Chinkamol
|
Tossaporn Saengja
|
Kiartnarin Udomlapsakul
|
Jitkapat Sawatphol
|
Piyalitt Ittichaiwong
We present VeReaFine, a novel “Verifier-RAG” pipeline designed to eliminate hallucinations in open-ended clinical question answering. VeReaFine interleaves three tightly coupled stages—retrieval, verification, and generation—across up to three iterations. First, a two-stage dense retriever (BM-Retriever-410M → BM-Reranker-2B) fetches and ranks top-k biomedical passages; an 8B-parameter MedReason verifier then filters these for direct relevance and identifies missing evidence. When the verifier deems the context insufficient, it formulates a focused “feedback query” to retrieve additional passages (bounded to prevent infinite loops). Once a minimal ground-truth context is assembled, a 7B-parameter generator (Qwen2.5-7B-Instruct) drafts an answer purely from that vetted context, and the verifier performs a final check—prompting the generator to refine any remaining unsupported claims. By iteratively fetching only missing facts and ensuring every assertion is evidence-backed, VeReaFine achieves monotonic factuality improvements with minimal overhead. On the BioNLP 2025 ClinIQLink “LLM Lie-Detector” shared task, our 7B generator augmented with VeReaFine matches or surpasses a 32B medical model on open-ended reasoning metrics, reducing multi-hop inverse step-identification errors by 26%. These findings demonstrate that moderate-size LLMs, when guided by targeted verification loops, can deliver expert-level reliability in clinical QA.