Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue


Abstract
Recent multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5/2.5 Pro, and Reka Core, have advanced audio-visual reasoning capabilities, achieving strong performance in tasks like cross-modal understanding and generation. However, our DeafTest uncovers unanticipated failures: most of the state-of-the-art MLLMs struggle with very simple audio tasks, such as distinguishing louder sounds or sound counting. This raises a fundamental question—does a deficiency in low-level audio perception constrain higher-level audio-visual reasoning? To address this, we introduce AV-Odyssey Bench—a comprehensive benchmark of 4,555 meticulously designed problems that integrate text, audio, and visual modalities. Each task requires models to unify cross-modal reasoning, leveraging synchronized audio-visual cues to infer solutions. By structuring questions as multiple-choice, we ensure objective, reproducible evaluations without reliance on subjective human or LLM-based judgments. Through comprehensive benchmarking of closed-source and open-source models, we showcase: (i) current MLLMs lack robust audio-visual integration ability and (ii) performance on DeafTest (Pearson’s r = 0.945) strongly correlates with AV-Odyssey accuracy. These findings challenge assumptions about models’ multimodal proficiency and highlight fundamental audio perception as a reasoning bottleneck. We believe that our results provide concrete guidance for future dataset design, alignment strategies, and architectures.
Anthology ID:
2026.acl-long.1697
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36603–36645
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1697/
DOI:
Bibkey:
Cite (ACL):
Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, and Xiangyu Yue. 2026. Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36603–36645, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Probing Audio-Visual Reasoning in Multimodal Language Models through the Lens of Audio (Gong et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1697.pdf
Checklist:
 2026.acl-long.1697.checklist.pdf