GOBench: Stage-Wise Diagnostics and the Visual Paradox in Multimodal Graph Optimization

Yinghao Chen, Wantong Xie, Shuli Zeng, Sijia Zhang, Xiaotian Pan, Feng Wu, Xiangyang Li


Abstract
Large language models (LLMs) and vision-language models (VLMs) are increasingly used as optimization assistants to produce solutions, generate solver-executable programs, or both. However, current evaluations are misaligned with deployment in three ways: they (P1) fail to represent multimodal problem specifications, (P2) score outcomes only and cannot localize where failures occur along the modeling pipeline, and (P3) rarely report inference cost, obscuring reliability–cost trade-offs. We introduce Graph Optimization benchmark (GOBench), an aligned multimodal benchmark with solver-derived oracles and a four-layer diagnostic protocol that evaluates intermediate artifacts as well as end results, together with the Visual Inference Penalty (VIP) to measure multimodal overhead. Across frontier and open-weight models under paired text-only vs. T+V settings, we find that vision reliably increases inference cost, while its reliability impact is regime-dependent: frontier models often benefit from visual grounding, whereas several mid-tier/open models exhibit a Visual Paradox where vision reduces downstream executability and verification coverage. End-to-end success is frequently bottlenecked by intermediate-stage dropout; supervised fine-tuning on intermediate targets can mitigate this attrition in open models, enabling a reproducible harness for diagnosing failure modes and quantifying reliability–cost trade-offs.
Anthology ID:
2026.findings-acl.306
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6144–6167
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.306/
DOI:
Bibkey:
Cite (ACL):
Yinghao Chen, Wantong Xie, Shuli Zeng, Sijia Zhang, Xiaotian Pan, Feng Wu, and Xiangyang Li. 2026. GOBench: Stage-Wise Diagnostics and the Visual Paradox in Multimodal Graph Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6144–6167, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
GOBench: Stage-Wise Diagnostics and the Visual Paradox in Multimodal Graph Optimization (Chen et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.306.pdf
Checklist:
 2026.findings-acl.306.checklist.pdf