Sudipta Paul
2026
Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA
Ximing Wen | Wenbo Li | Sudipta Paul | Yashas Malur Saidutta | Kalpa Gunaratna | Srinivas Chappidi
Findings of the Association for Computational Linguistics: ACL 2026
Ximing Wen | Wenbo Li | Sudipta Paul | Yashas Malur Saidutta | Kalpa Gunaratna | Srinivas Chappidi
Findings of the Association for Computational Linguistics: ACL 2026
Visually Grounded Document Question Answering often lacks robust, end-to-end solutions capable of handling complex, multi-answer queries without reliance on ad-hoc processing. In this work, we propose two turnkey LLM architectures to address this gap. We first introduce a single-head architecture where coordinates are represented as special tokens within the unified vocabulary. While structurally robust, this approach suffers from the limitations of discrete supervision; to address this, we propose a novel “softening token” method that enables differentiable Mean-Squared-Error loss over token probabilities. Although this significantly improves visual grounding, the spatial precision remains bound by discretization. Consequently, we propose a second solution: a dual-head architecture that alternates between text generation and regression-based bounding box prediction. This method offers high spatial precision via a regression head, further stabilized by our introduction of an Intersection-over-Union loss. Finally, by combining the single head model’s structural robustness with the high precision of the dual head model, we propose an ensemble method that yields significant performance gains beyond each of individual components.