When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang


Abstract
Large Audio-Language Models (LALMs) are augmented with the ability to perceive audio, demonstrating impressive capabilities in processing combined audio and text signals. However, their reliability when faced with conflicting inputs across modalities remains largely unexplored. This study examines how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, often disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, explore mitigation strategies through supervised fine-tuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balancing during training and more sophisticated fusion mechanisms to enhance robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
Anthology ID:
2025.emnlp-main.246
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4878–4888
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.246/
DOI:
Bibkey:
Cite (ACL):
Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, and Tianwei Zhang. 2025. When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4878–4888, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models (Wang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.246.pdf
Checklist:
 2025.emnlp-main.246.checklist.pdf