DICA: Dual-Indicator Guided Contrastive Alignment in Multimodal Large Language Models

Hao Yang; Jin Wang; Xuejie Zhang

DICA: Dual-Indicator Guided Contrastive Alignment in Multimodal Large Language Models

Abstract

Human visual reasoning typically follows a coarse-to-fine attention process, starting from global scene understanding and gradually focusing on question-relevant regions. However, multimodal large language models may deviate from this pattern due to attention drift and the underutilization of visual evidence, which can lead to hallucinations. To mitigate these issues, this study proposes a Dual-Indicator Guided Contrastive Alignment (DICA), which tracks two information-theoretic indicators during inference: Visual Attention Entropy (VAE), which reflects the concentration of visual attention, and Output Image Correlation (OIC), which measures the dependence of generated outputs on the visual input. An abnormal increase in VAE or a decrease in OIC corresponds to different failure modes, which trigger targeted contrastive alignment to restore visual grounding. Experimental results across multiple benchmarks demonstrate that DICA consistently outperforms existing approaches and substantially reduces hallucinations, highlighting the effectiveness of indicator-driven intervention in improving multimodal inference reliability. The code is publicly available at https://github.com/BGWH123/DICA/.

Anthology ID:: 2026.findings-acl.1933
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38797–38818
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1933/
DOI:
Bibkey:
Cite (ACL):: Hao Yang, Jin Wang, and Xuejie Zhang. 2026. DICA: Dual-Indicator Guided Contrastive Alignment in Multimodal Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38797–38818, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DICA: Dual-Indicator Guided Contrastive Alignment in Multimodal Large Language Models (Yang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1933.pdf
Checklist:: 2026.findings-acl.1933.checklist.pdf

PDF Cite Search Checklist Fix data