Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses
Sungnyun Kim, Kangwook Jang, Sungwoo Cho, Joon Son Chung, Hoi-Rin Kim, Se-Young Yun
Abstract
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, **DualHyp**, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce **RelPrompt**, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.- Anthology ID:
- 2026.findings-acl.26
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 544–564
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.26/
- DOI:
- Cite (ACL):
- Sungnyun Kim, Kangwook Jang, Sungwoo Cho, Joon Son Chung, Hoi-Rin Kim, and Se-Young Yun. 2026. Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses. In Findings of the Association for Computational Linguistics: ACL 2026, pages 544–564, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses (Kim et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.26.pdf