Compositional Reasoning via Joint Image and Language Decomposition
Dwip Dalal, Madhav Kanda, Zhenhailong Wang, Heng Ji, Unnat Jain
Abstract
Multimodal reasoning tasks such as visual question answering (VQA) require models to process both language and visual inputs. However, existing approaches typically decompose only language queries, treating images as monolithic inputs. We introduce REDI, a framework that jointly decomposes both images and questions into visual sub-domains (segmentation, material, depth, and color) with corresponding sub-questions. REDI uses an MLLM orchestrator to select the sub-domains required for each query, generate domain-specific sub-questions with grounded object references (via shared object labels), and fuse worker outputs via consistency-aware aggregation (verify–refine–override) to produce the final answer. This hierarchical multi-agent design mitigates error propagation and improves compositional reasoning across both open- and closed-source MLLMs. On SEEDBench, MMBench, and CLEVR, REDI achieves absolute accuracy improvements of 8.9%, 8.2%, and 16.0% over chain-of-thought and visual programming baselines. Project webpage: https://madhav-kanda.github.io/redi- Anthology ID:
- 2026.findings-eacl.304
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5753–5775
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.304/
- DOI:
- Cite (ACL):
- Dwip Dalal, Madhav Kanda, Zhenhailong Wang, Heng Ji, and Unnat Jain. 2026. Compositional Reasoning via Joint Image and Language Decomposition. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5753–5775, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Compositional Reasoning via Joint Image and Language Decomposition (Dalal et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.304.pdf