Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong


Abstract
Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP’s effectiveness, achieving up to 13% absolute improvement over standard baselines.
Anthology ID:
2026.findings-acl.850
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17207–17224
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.850/
DOI:
Bibkey:
Cite (ACL):
Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, and Derek F. Wong. 2026. Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17207–17224, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA (Chen et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.850.pdf
Checklist:
 2026.findings-acl.850.checklist.pdf