Alireza Mohamadian


2026

VLMs provide visual information alongside their predictions, but it remains unclear whether consistency in such information implies consistent decisions. We study this question in a controlled medical-imaging setting using brain MRI with pathology-confirmed labels and expert lesion annotations. For each human subject and modality, we construct configurations that retain the lesion content while varying surrounding context and scale and measure decision flips together with consistency in model-reported influential slices. Across four diverse VLMs (including proprietary, open-source, and domain-specific models), flip rates reach up to 75% across lesion-containing presentations, often despite high overlap in reported evidence. When lesion-related content is removed, proprietary models rarely produce a categorical diagnosis, with abstention rates ranging from 63% to 99%. These results reveal a mismatch between reported evidence and decisions, motivating evaluation beyond accuracy. Our evaluation dataset is publicly available on Hugging Face.