Less is More: Controlled Visual Evidence Routing and Redundancy Compression for Key Information Extraction

Yang Li, Yajiao Wang, Wenhao Hu, Mengting Zhang, Zhixiong Zhang


Abstract
Key Information Extraction (KIE) in visually-rich documents is inherently token-centric, yet prevailing multimodal encoders often fuse dense visual patches with text tokens indiscriminately, which can introduce low-density visual noise, intensify modality competition, and cause robustness collapse under distribution shifts. We propose OTCR, a lightweight and architecture-agnostic framework that turns vision from a competitor into a selective supporter for extraction. OTCR learns sparse, interpretable cross-modal coupling via optimal transport to route local visual evidence to the most relevant text tokens, applies token-level gating to control injection strength, and further suppresses spurious correlations through a variational information bottleneck. Experiments on FUNSD, CORD, and SROIE show consistent gains when OTCR is plugged into LayoutLMv3 and GeoLayoutLM, and ablations verify the complementary contributions of coupling, gating, and bottlenecking. Under distribution shifts from Do-GOOD and EC-FUNSD, OTCR markedly mitigates performance degradation, indicating that controlled visual evidence can effectively compensate when text/layout shortcuts become unreliable.
Anthology ID:
2026.magmar-main.10
Volume:
Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Month:
July
Year:
2026
Address:
San Diego, USA
Editors:
Kenton Murray, Reno Kriz
Venues:
MAGMaR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42–53
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.magmar-main.10/
DOI:
Bibkey:
Cite (ACL):
Yang Li, Yajiao Wang, Wenhao Hu, Mengting Zhang, and Zhixiong Zhang. 2026. Less is More: Controlled Visual Evidence Routing and Redundancy Compression for Key Information Extraction. In Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026), pages 42–53, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):
Less is More: Controlled Visual Evidence Routing and Redundancy Compression for Key Information Extraction (Li et al., MAGMaR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.magmar-main.10.pdf