Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs
Kyusik Kim, Hyunwoo Yoo, Jaehoon Choi, Gail Rosen, Bongwon Suh
Abstract
The transition to end-to-end Multimodal Large Language Models (MLLMs) has positioned these architectures as active social evaluators in high-stakes domains. However, it remains unclear whether these models maintain objective auditory perception or succumb to the "Hearing with Eyes" phenomenon, where visual racial cues distort linguistic proficiency evaluations. We investigate this cross-modal bias by constructing a controlled counterfactual dataset utilizing a Visual Matched-Guise Paradigm. By pairing identical native audio with diverse visual personas across English and Korean contexts, we reveal a distinct Cultural Asymmetry in model behavior. In Anglophone settings, most closed models exhibit Reverse Linguistic Stereotyping, hallucinating non-native accents for Asian speakers despite standard native audio. Conversely, in Korean settings, the same models assign baseline-relative competence premiums across all visual personas, with the largest gains for out-group (White/Black) speakers, consistent with Expectancy Violation Theory. Our findings demonstrate that MLLMs do not merely process sensory inputs but actively reproduce context-dependent sociolinguistic ideologies.- Anthology ID:
- 2026.findings-acl.1362
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27332–27358
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1362/
- DOI:
- Cite (ACL):
- Kyusik Kim, Hyunwoo Yoo, Jaehoon Choi, Gail Rosen, and Bongwon Suh. 2026. Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27332–27358, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs (Kim et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1362.pdf