Sungnyun Kim
2026
Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses
Sungnyun Kim | Kangwook Jang | Sungwoo Cho | Joon Son Chung | Hoi-Rin Kim | Se-Young Yun
Findings of the Association for Computational Linguistics: ACL 2026
Sungnyun Kim | Kangwook Jang | Sungwoo Cho | Joon Son Chung | Hoi-Rin Kim | Se-Young Yun
Findings of the Association for Computational Linguistics: ACL 2026
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, **DualHyp**, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce **RelPrompt**, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.
2024
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
Sungnyun Kim | Haofu Liao | Srikar Appalaraju | Peng Tang | Zhuowen Tu | Ravi Kumar Satzoda | R. Manmatha | Vijay Mahadevan | Stefano Soatto
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Sungnyun Kim | Haofu Liao | Srikar Appalaraju | Peng Tang | Zhuowen Tu | Ravi Kumar Satzoda | R. Manmatha | Vijay Mahadevan | Stefano Soatto
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Visual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data is not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.