Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering

Haosen Wang, Jing Xiao, Chaochao Du, Xiaowang Zhang, Zhiyong Feng


Abstract
Document Visual Question Answering (DocVQA) aims to generate answers by jointly understanding the textual, layout, and visual elements within document images. Although end-to-end vision-based generative methods have reduced dependency on OCR, they still struggle to achieve precise evidence localization when page semantics are complex and highly similar. However, existing research lacks an in-depth theoretical analysis of the question-driven semantic representation space, failing to fundamentally address the distinguishability problem among semantically similar pages. To fill this theoretical gap, we propose and prove that, given a specific question, each page possesses a unique semantic representation, and there exists a bijective mapping between the page and its unique semantics. Based on this theoretical foundation, we introduce the Flow-Based Page Unique Semantic Mapping Architecture (FUMA), which reconstructs evidence localization from similarity-based retrieval into precise selection on unique semantics. FUMA employs fine-grained cross-modal attention to extract discriminative cues and utilizes flow-based reversible transformations with likelihood regularization to learn bijective mappings, ensuring that each page obtains a unique semantic representation. Moreover, a multi-expert collaboration mechanism complementarily models fine-grained multimodal information within each page, achieving robust answer generation. Experimental results demonstrate that FUMA significantly outperforms existing methods in both evidence localization and answer generation.
Anthology ID:
2026.acl-long.1679
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36260–36280
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1679/
DOI:
Bibkey:
Cite (ACL):
Haosen Wang, Jing Xiao, Chaochao Du, Xiaowang Zhang, and Zhiyong Feng. 2026. Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36260–36280, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering (Wang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1679.pdf
Checklist:
 2026.acl-long.1679.checklist.pdf