LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Aishik Nagar, Viktor Schlegel, Thanh-Tung Nguyen, Hao Li, Yuping Wu, Kuluhan Binici, Stefan Winkler


Abstract
Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs’ task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs—including BioMistral and Llama-2 models—on a diverse set of biomedical datasets, using standard prompting, Chain-of-Thought (CoT) and Self-Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter-intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.
Anthology ID:
2025.insights-1.11
Volume:
The Sixth Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Aleksandr Drozd, João Sedoc, Shabnam Tafreshi, Arjun Akula, Raphael Shu
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
106–120
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.insights-1.11/
DOI:
Bibkey:
Cite (ACL):
Aishik Nagar, Viktor Schlegel, Thanh-Tung Nguyen, Hao Li, Yuping Wu, Kuluhan Binici, and Stefan Winkler. 2025. LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction. In The Sixth Workshop on Insights from Negative Results in NLP, pages 106–120, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction (Nagar et al., insights 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.insights-1.11.pdf