INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent

Yuanlei Wang; Liuzhou Zhang; Haohao Luo; Ying Shen

doi:10.18653/v1/2025.findings-emnlp.486

INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent

Yuanlei Wang, Liuzhou Zhang, Haohao Luo, Ying Shen

Abstract

Graphical User Interface (GUI) interaction, which aims to develop an intelligent GUI agent that executes user instructions to perform tasks such as installing applications by controlling digital devices, has gained significant attention due to its practical value. Although current advanced multimodal large language models (LLMs) provide GUI agents with robust perception and reasoning capabilities, they often struggle with the precise localization of small elements. To tackle this problem, we propose InReAct, a multimodal GUI agent framework that unifies observing, thinking, and acting for precise and interpretable decision-making. It is trained via a two-stage process: curriculum learning to progressively build perception, grounding, and reasoning abilities, followed by reinforcement learning to refine pixel-level grounding with an outcome-based reward. We introduce a rule-based reward function that jointly optimizes action-type selection and pixel-level localization accuracy. Experimental results on multiple datasets demonstrate the superiority of InReAct in both grounding and navigation tasks.

Anthology ID:: 2025.findings-emnlp.486
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9148–9160
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.486/
DOI:: 10.18653/v1/2025.findings-emnlp.486
Bibkey:
Cite (ACL):: Yuanlei Wang, Liuzhou Zhang, Haohao Luo, and Ying Shen. 2025. INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9148–9160, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.486.pdf
Checklist:: 2025.findings-emnlp.486.checklist.pdf

PDF Cite Search Checklist Fix data