When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang
Abstract
Large Audio-Language Models (LALMs) are augmented with the ability to perceive audio, demonstrating impressive capabilities in processing combined audio and text signals. However, their reliability when faced with conflicting inputs across modalities remains largely unexplored. This study examines how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, often disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, explore mitigation strategies through supervised fine-tuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balancing during training and more sophisticated fusion mechanisms to enhance robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.- Anthology ID:
- 2025.emnlp-main.246
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4878–4888
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.246/
- DOI:
- Cite (ACL):
- Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, and Tianwei Zhang. 2025. When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4878–4888, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models (Wang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.246.pdf