Bohan Lei
2026
Act as you think: Reinforcing Consistent Reasoning in Medical Visual Question Answering
Songtao Jiang | Yuan Wang | Ruizhe Chen | Yan Zhang | Ruilin Luo | Bohan Lei | Yeying Jin | Sibo Song | ZhiBo Yang | Jimeng Sun | Jian Wu | Zuozhu Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Songtao Jiang | Yuan Wang | Ruizhe Chen | Yan Zhang | Ruilin Luo | Bohan Lei | Yeying Jin | Sibo Song | ZhiBo Yang | Jimeng Sun | Jian Wu | Zuozhu Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While reinforcement learning from verifiable rewards (RLVR) has been proven highly effective for enhancing reasoning, its application to medical visual question answering (Med-VQA) is hampered by models producing reasoning inconsistent with either the visual evidence or the final answer. Our analysis reveals a critical flaw in RLVR training: it paradoxically encourages models to disregard visual evidence and generate answers that contradict their own reasoning. This degradation is most pronounced in specialized medical modalities (e.g., Fundus, Ultrasound) where base VLMs lack robust understanding, a failure we attribute to a flawed reward mechanism exacerbated by the scarcity of diverse training data. To tackle this, we introduce Med-Zero-17K, a large-scale dataset spanning over 30 modalities and 24 clinically relevant tasks, and the Multi-Consistency Reward (MCR) framework, which explicitly rewards both perceptual grounding and logical coherence. Extensive experiments validate our approach: integrating MCR into the RLVR framework delivers robust performance gains. This success stems from our crucial finding that rewarding internal consistency is significantly more effective than attempting to judge reasoning correctness. Furthermore, MCR proves highly versatile, exhibiting strong generalization across diverse VLM backbones, compatibility with RL algorithms like GRPO and DPO, and extending its effectiveness to 3D VQA tasks and R1-style training paradigms. Code and dataset will be released.