Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning

Flavio Merenda, Jose Manuel Gomez-Perez, German Rigau


Abstract
Large language models (LLMs) have shown remarkable progress in reasoning across multiple domains. However, it remains unclear whether their abilities reflect genuine reasoning or sophisticated pattern matching, a distinction critical in medical decision-making, where reliable multi-step problem-solving is required. Accordingly, we conduct one of the largest evaluations to date, assessing 77 LLMs with diverse fine-tuning approaches, ranging from 1 billion parameters to frontier models. Guided by medical problem-solving theory, we select three medical question answering (QA) benchmarks targeting key reasoning skills: reasoning processes, susceptibility to cognitive biases, and metacognitive abilities. Additionally, we manually annotate a subset of questions to assess the abduction, deduction, and induction capabilities of LLMs, offering detailed insight into the reasoning mechanisms followed by physicians, an aspect that has received relatively limited attention in this domain. Most models, particularly smaller ones, struggle even with specialized fine-tuning or advanced prompting. Larger models perform better but still show clear limitations in complex medical reasoning. Our findings highlight the need to improve specific reasoning strategies to better reflect medical decision-making. The datasets and code used in this study are publicly available at: https://github.com/expertailab/Can-LLMs-Reason-Like-Doctors
Anthology ID:
2026.findings-eacl.127
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2432–2452
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.127/
DOI:
Bibkey:
Cite (ACL):
Flavio Merenda, Jose Manuel Gomez-Perez, and German Rigau. 2026. Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2432–2452, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning (Merenda et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.127.pdf
Checklist:
 2026.findings-eacl.127.checklist.pdf