Yuanlei Wang


2025

pdf bib
INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent
Yuanlei Wang | Liuzhou Zhang | Haohao Luo | Ying Shen
Findings of the Association for Computational Linguistics: EMNLP 2025

Graphical User Interface (GUI) interaction, which aims to develop an intelligent GUI agent that executes user instructions to perform tasks such as installing applications by controlling digital devices, has gained significant attention due to its practical value. Although current advanced multimodal large language models (LLMs) provide GUI agents with robust perception and reasoning capabilities, they often struggle with the precise localization of small elements. To tackle this problem, we propose InReAct, a multimodal GUI agent framework that unifies observing, thinking, and acting for precise and interpretable decision-making. It is trained via a two-stage process: curriculum learning to progressively build perception, grounding, and reasoning abilities, followed by reinforcement learning to refine pixel-level grounding with an outcome-based reward. We introduce a rule-based reward function that jointly optimizes action-type selection and pixel-level localization accuracy. Experimental results on multiple datasets demonstrate the superiority of InReAct in both grounding and navigation tasks.