Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang


Abstract
Large Multimodal Models (LMMs) have demonstrated impressive performance on existing medical Visual Question Answering (Med-VQA) benchmarks. However, high reported accuracy does not necessarily reflect their true diagnostic reliability in clinical settings. This study reveals that state-of-the-art models perform worse than random guessing on medical diagnosis questions when subjected to simple Probing Evaluation for Medical Diagnosis (ProbMed). ProbMed challenges models through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing ground-truth questions with adversarial counterparts that feature negated and hallucinated attributes, while procedural diagnosis requires reasoning across various dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that even top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Furthermore, our ablation study on open-source models (e.g., LLaVA, LLaVA-Med, and Med-Flamingo) identifies poor visual understanding as a primary bottleneck—a limitation that can be partially mitigated by incorporating visual descriptions generated by GPT-4o, resulting in an average performance improvement of 9.44%. These findings underscore the urgent need for more robust evaluation methods and domain-specific expertise to ensure the reliability of LMMs in high-stakes medical applications.
Anthology ID:
2025.findings-acl.981
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19188–19205
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.981/
DOI:
Bibkey:
Cite (ACL):
Qianqi Yan, Xuehai He, Xiang Yue, and Xin Eric Wang. 2025. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19188–19205, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA (Yan et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.981.pdf