FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen
Abstract
Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail.To address this gap, we introduce FineState-Bench, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state.FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting.We further propose FineState-Metrics, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play Visual Diagnostic Assistant (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs. w/o comparisons.On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8% on Web and 22.8% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction Github.- Anthology ID:
- 2026.findings-acl.2136
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 43073–43088
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2136/
- DOI:
- Cite (ACL):
- Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, and Xiuying Chen. 2026. FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43073–43088, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting (Ji et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2136.pdf