Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs

Kyusik Kim, Hyunwoo Yoo, Jaehoon Choi, Gail Rosen, Bongwon Suh


Abstract
The transition to end-to-end Multimodal Large Language Models (MLLMs) has positioned these architectures as active social evaluators in high-stakes domains. However, it remains unclear whether these models maintain objective auditory perception or succumb to the "Hearing with Eyes" phenomenon, where visual racial cues distort linguistic proficiency evaluations. We investigate this cross-modal bias by constructing a controlled counterfactual dataset utilizing a Visual Matched-Guise Paradigm. By pairing identical native audio with diverse visual personas across English and Korean contexts, we reveal a distinct Cultural Asymmetry in model behavior. In Anglophone settings, most closed models exhibit Reverse Linguistic Stereotyping, hallucinating non-native accents for Asian speakers despite standard native audio. Conversely, in Korean settings, the same models assign baseline-relative competence premiums across all visual personas, with the largest gains for out-group (White/Black) speakers, consistent with Expectancy Violation Theory. Our findings demonstrate that MLLMs do not merely process sensory inputs but actively reproduce context-dependent sociolinguistic ideologies.
Anthology ID:
2026.findings-acl.1362
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27332–27358
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1362/
DOI:
Bibkey:
Cite (ACL):
Kyusik Kim, Hyunwoo Yoo, Jaehoon Choi, Gail Rosen, and Bongwon Suh. 2026. Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27332–27358, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Visual Interference in Speech Evaluation: Cultural Asymmetry and Cross-Modal Bias in MLLMs (Kim et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1362.pdf
Checklist:
 2026.findings-acl.1362.checklist.pdf