VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Rui Hu; Delai Qiu; Yining Wang; Shengping Liu; Jitao Sang (桑基韬)

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang

Abstract

Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: Visual Interference, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models’ inference process to follow the human-like “Look-then-Listen” inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a think> block to serve as semantic anchors, then generates the transcription in an answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.

Anthology ID:: 2026.acl-long.425
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9417–9432
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.425/
DOI:
Bibkey:
Cite (ACL):: Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, and Jitao Sang. 2026. VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9417–9432, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models (Hu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.425.pdf
Checklist:: 2026.acl-long.425.checklist.pdf

PDF Cite Search Checklist Fix data