Efficient Visual Grounding in VQA via Question-Guided Sparse Attention

Prasanth


Abstract
Visual Question Answering (VQA) models process all image patches uniformlydespite questions typically requiring only a small subset of visual information.This inefficiency leads to unnecessary computation and can result in attentiondilution across irrelevant image regions. We propose Question-GuidedSparse Attention (QGSA), a plug-and-play mechanism that dynamically selectsrelevant image patches conditioned on question semantics. Our approach introducesthree components: (1)a differentiable patch selector based on Gumbel-Softmaxreparameterisation that enables end-to-end training with hard patch selection atinference; (2)a self-supervised grounding loss that encourages spatialselectivity without bounding-box annotations, combining contrastive patchselection with patch–word alignment via a frozen CLIP encoder; and (3)anadaptive sparsity mechanism that adjusts the number of selected patches accordingto estimated question complexity. Experiments on SmolVLM-256M-Instruct andSmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)demonstrate that QGSA reduces cross-attention FLOPs by 91–99% across inputresolutions, achieving up to 76× theoretical speedup at 576px resolution, whilemaintaining exact accuracy parity with the dense baseline (𝛥=0.0 ppon all datasets).Wall-clock parity with the dense baseline is reached at 336px; realisedend-to-end speedup requires larger models where cross-attention dominates totalcompute. QGSA consistently selects an average of k≈17 patches out of576 (256M model), up to k≈18 (500M model), yielding up to a 34×reduction in the visual token sequence. These small-scale results validate thefeasibility of question-conditioned sparse attention and provide a foundation forscaling to larger VLMs.
Anthology ID:
2026.alvr-main.24
Volume:
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
260–271
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.24/
DOI:
Bibkey:
Cite (ACL):
Prasanth. 2026. Efficient Visual Grounding in VQA via Question-Guided Sparse Attention. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 260–271, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Efficient Visual Grounding in VQA via Question-Guided Sparse Attention (Prasanth, ALVR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.24.pdf