Abstract
Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model’s overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding “single-hop” reasoning, whereas only a few questions require “multi-hop” reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.- Anthology ID:
- 2024.findings-acl.636
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10698–10709
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-acl.636/
- DOI:
- 10.18653/v1/2024.findings-acl.636
- Cite (ACL):
- Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, and Joo-Kyung Kim. 2024. II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10698–10709, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering (Kil et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-acl.636.pdf