Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Wei Han, Hantao Huang, Tao Han


Abstract
Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.
Anthology ID:
2020.coling-main.278
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3118–3131
Language:
URL:
https://aclanthology.org/2020.coling-main.278
DOI:
10.18653/v1/2020.coling-main.278
Bibkey:
Cite (ACL):
Wei Han, Hantao Huang, and Tao Han. 2020. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3118–3131, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering (Han et al., COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2020.coling-main.278.pdf
Data
ST-VQATextVQAVisual Question Answering