Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao; Shuzheng Si; Liang Chen; Yichi Zhang; Maosong Sun (孙茂松); Baobao Chang (常宝宝); Minjia Zhang

doi:10.18653/v1/2025.emnlp-main.995

Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, Minjia Zhang

Abstract

Large vision-language models (LVLMs) have achieved impressive results in vision-language tasks. However, Therefore, we propose LACING, designed to address such bias with Mu ̲Ltimodal Du ̲Al-attention Me ̲Chan ̲Ism (MDA) a ̲Nd Soft-Image ̲Guidance (SIG). Specifically, MDA adopts a parallel dual-attention mechanism that constructs separate attention for visual and text inputs to enhance integration of visual inputs across model. SIG uses a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs during inference. Experiments across different model architectures and scales demonstrate that LACING effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without additional resources.

Anthology ID:: 2025.emnlp-main.995
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19677–19701
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.995/
DOI:: 10.18653/v1/2025.emnlp-main.995
Bibkey:
Cite (ACL):: Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Baobao Chang, and Minjia Zhang. 2025. Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19677–19701, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance (Zhao et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.995.pdf
Checklist:: 2025.emnlp-main.995.checklist.pdf

PDF Cite Search Checklist Fix data