Xianglin Yang


2025

pdf bib
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang | Gelei Deng | Xianglin Yang | Han Qiu | Tianwei Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Audio-Language Models (LALMs) are augmented with the ability to perceive audio, demonstrating impressive capabilities in processing combined audio and text signals. However, their reliability when faced with conflicting inputs across modalities remains largely unexplored. This study examines how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, often disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, explore mitigation strategies through supervised fine-tuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balancing during training and more sophisticated fusion mechanisms to enhance robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.