Abstract
We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct 50k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks.- Anthology ID:
- 2024.emnlp-main.284
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4941–4957
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.284
- DOI:
- 10.18653/v1/2024.emnlp-main.284
- Cite (ACL):
- Chuanqi Cheng, Jian Guan, Wei Wu, and Rui Yan. 2024. From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4941–4957, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis (Cheng et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.284.pdf