Flavio Merenda

2026

Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning
Flavio Merenda | Jose Manuel Gomez-Perez | German Rigau
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) have shown remarkable progress in reasoning across multiple domains. However, it remains unclear whether their abilities reflect genuine reasoning or sophisticated pattern matching, a distinction critical in medical decision-making, where reliable multi-step problem-solving is required. Accordingly, we conduct one of the largest evaluations to date, assessing 77 LLMs with diverse fine-tuning approaches, ranging from 1 billion parameters to frontier models. Guided by medical problem-solving theory, we select three medical question answering (QA) benchmarks targeting key reasoning skills: reasoning processes, susceptibility to cognitive biases, and metacognitive abilities. Additionally, we manually annotate a subset of questions to assess the abduction, deduction, and induction capabilities of LLMs, offering detailed insight into the reasoning mechanisms followed by physicians, an aspect that has received relatively limited attention in this domain. Most models, particularly smaller ones, struggle even with specialized fine-tuning or advanced prompting. Larger models perform better but still show clear limitations in complex medical reasoning. Our findings highlight the need to improve specific reasoning strategies to better reflect medical decision-making. The datasets and code used in this study are publicly available at: https://github.com/expertailab/Can-LLMs-Reason-Like-Doctors

2018

pdf bib

Source-driven Representations for Hate Speech Detection
Flavio Merenda | Claudia Zaghi | Tommaso Caselli | Malvina Nissim
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

Co-authors

Venues

CLiC-it1
Findings1

Fix author