Listening Like Humans: Semantics-Guided Noise-Robust Multimodal Speech Recognition

Yan Fang, Jun Chen, Yian Yao, Shuxin Zhong, Min Sun, Kaishun Wu


Abstract
Severe acoustic degradation is often caused by overlapping noise, disfluencies, and environmental distortions. This phenomenon results in the dissolution of linguistic structures and the generation of unreliable ASR outputs. Inspired by human speech comprehension, we propose Speech-MLM, a novel multimodal framework that reframes ASR as semantics-guided speech reconstruction. This perspective introduces three core challenges: (C1) collapse of linguistic structure under acoustic degradation, (C2) semantic ambiguity under noise, and (C3) misalignment across modalities. To address these issues, we propose Speech-MLM, a multimodal ASR framework that integrates speech, spectrogram-derived visual cues, and textual variants to enhance robustness. It consists of: (i) Cognitive Structure Extractor that recovers prosodic structure from visualized acoustic features, (ii) Semantic Weaver that learns semantic equivalence across varied textual forms, and (iii) Retrieval-Guided Fusion Learner that unifies modalities within a shared semantic space. Experiments on multiple real-world noisy datasets demonstrate that Speech-MLM achieves an average 38.85% reduction in WER, while also attaining 98.71% BERTScore and 96.7% USE, over advanced baselines, demonstrating substantial gains in semantic robustness and generalization across domains.
Anthology ID:
2026.acl-long.730
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16079–16093
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.730/
DOI:
Bibkey:
Cite (ACL):
Yan Fang, Jun Chen, Yian Yao, Shuxin Zhong, Min Sun, and Kaishun Wu. 2026. Listening Like Humans: Semantics-Guided Noise-Robust Multimodal Speech Recognition. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16079–16093, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Listening Like Humans: Semantics-Guided Noise-Robust Multimodal Speech Recognition (Fang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.730.pdf
Checklist:
 2026.acl-long.730.checklist.pdf