DeVisE: Towards the Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto


Abstract
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations
Anthology ID:
2026.findings-eacl.338
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6427–6441
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.338/
DOI:
Bibkey:
Cite (ACL):
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, and Iacer Calixto. 2026. DeVisE: Towards the Behavioral Testing of Medical Large Language Models. In Findings of the Association for Computational Linguistics: EACL 2026, pages 6427–6441, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
DeVisE: Towards the Behavioral Testing of Medical Large Language Models (Tagliabue et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.338.pdf
Checklist:
 2026.findings-eacl.338.checklist.pdf