Rui Hu
Other people with similar names: Rui Hu
Unverified author pages with similar names: Rui Hu
2026
VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
Rui Hu | Delai Qiu | Yining Wang | Shengping Liu | Jitao Sang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rui Hu | Delai Qiu | Yining Wang | Shengping Liu | Jitao Sang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: Visual Interference, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models’ inference process to follow the human-like “Look-then-Listen” inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a think> block to serve as semantic anchors, then generates the transcription in an answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.