Closing the Spatial Execution Gap in Digital Whiteboards via Verifiable Reinforcement Learning

Chang Liu, Benjamin Wagley, Zibo Wang, Mehmet E. Belviranli, Bo Wu


Abstract
While multi-modal large language models such as GPT-5 demonstrate exceptional general understanding, they suffer from a fundamental Spatial Execution Gap, failing to translate visual semantics into precise, schema-valid coordinate operations in interactive environments. In this work, we show that model scale alone cannot close this gap; instead, verifiable structured reasoning provides the key to spatial precision. We present a comprehensive pipeline that leverages Group Relative Policy Optimization to enforce a strict Identify-Reason-Verify protocol, effectively shifting the computational burden from parameters to test-time reasoning. By utilizing a multi-agent system to distill optimal reasoning schemas and training on execution-verifiable rewards, our specialized 3B agent achieves 100% format coherence and 81.12% operation accuracy on digital whiteboard tasks. Crucially, our approach outperforms a state-of-the-art frontier model, GPT-5, by 16.75% in operation accuracy. The results suggest that for complex user interface manipulation, small, RL-aligned models with dedicated reasoning protocols are superior to generalist frontier models, offering a promising direction for building reliable web agents.
Anthology ID:
2026.acl-long.630
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13834–13849
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.630/
DOI:
Bibkey:
Cite (ACL):
Chang Liu, Benjamin Wagley, Zibo Wang, Mehmet E. Belviranli, and Bo Wu. 2026. Closing the Spatial Execution Gap in Digital Whiteboards via Verifiable Reinforcement Learning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13834–13849, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Closing the Spatial Execution Gap in Digital Whiteboards via Verifiable Reinforcement Learning (Liu et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.630.pdf
Checklist:
 2026.acl-long.630.checklist.pdf