Fen Miao


2026

Medical large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal clinical applications such as medical visual question answering and report generation. However, Med-LVLMs remain challenged by hallucinations caused by modality misalignment, where models prioritize textual knowledge over visual evidence and generate outputs that conflict with medical images. To mitigate this issue, recent studies have explored preference optimization to improve image–text alignment, achieving promising results. Despite these advances, existing preference-based methods still face two limitations in medical settings: (1) overfitting to superficial cues, and (2) pseudo convergence of the preference signal. In this paper, we propose Dynamic Evidence-Guided Preference Optimization (DEPO), a new framework that enables evidence-aware and adaptive preference learning for Med-LVLMs. DEPO introduces Multi-Modal Evidence Perturbation (MEP) to suppress non-causal textual and visual shortcuts, and Dispreferred Evidence Resampling (DER) to continuously update dispreferred responses as hallucination patterns evolve. Experiments on multiple medical VQA and report generation benchmarks demonstrate consistent improvements over existing methods, with strong robustness across datasets and architectures. All Codes and data will be released after review.