Jihao Gu
2026
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present Mobile-R1, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the “Eureka” moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce InquireBench, a comprehensive benchmark specifically designed to evaluate mobile agents’ capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose InquireMobile, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. The project page is available at https://bit-aqh.github.io/InquireMobile/homepage/.
GUI0: Self-Evolving Foundational GUI Agents in Super App Ecosystems
Xinyi Wang | Wei Dai | Kyle Qiao | Ke Wang | Peng Chen | Gang Cao | Kangqin | Zhongpu Wang | Xiaode Zhang | Yanming Liu | Jihao Gu | Jingtao Xu | Gong Zhi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyi Wang | Wei Dai | Kyle Qiao | Ke Wang | Peng Chen | Gang Cao | Kangqin | Zhongpu Wang | Xiaode Zhang | Yanming Liu | Jihao Gu | Jingtao Xu | Gong Zhi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated interaction with graphical user interfaces (GUIs) is central to General Artificial Intelligence yet remains challenging within Super App ecosystems, characterized by non-standard rendering and absent accessibility metadata. While GUI agents often rely on explicit accessibility trees or static imitation, they are less explored for dynamic environments marked by sparse feedback and implicit visual cues. We present GUI0, a framework synergizing autonomous data synthesis with dual-agent co-evolution. GUI0 establishes a domain-aware foundation model via synthesized corpora and employs curriculum-driven reinforcement learning, where a curriculum agent generates boundary tasks to optimize an actor agent.Empirical results demonstrate three key advantages: (1) State-of-the-art performance on the SuperAPP benchmark, outperforming Gemini-2.5-Pro and Claude-4-Sonnet; (2) universal efficacy across diverse base models, consistently yielding substantial improvements on both Qwen2.5-VL and GUI-Owl variants; and (3) robust zero-shot generalization to standard GUIs (e.g., +62.7% on ScreenSpot Pro).
2025
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Jianyu Liu | Hangyu Guo | Ranjie Duan | Xingyuan Bu | Yancheng He | Shilong Li | Hui Huang | Jiaheng Liu | Yucheng Wang | Chenchen Jing | Xingwei Qu | Xiao Zhang | Pei Wang | Yanan Wu | Jihao Gu | Yangguang Li | Jianke Zhu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Jianyu Liu | Hangyu Guo | Ranjie Duan | Xingyuan Bu | Yancheng He | Shilong Li | Hui Huang | Jiaheng Liu | Yucheng Wang | Chenchen Jing | Xingwei Qu | Xiao Zhang | Pei Wang | Yanan Wu | Jihao Gu | Yangguang Li | Jianke Zhu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce DREAM ( Disentangling Risks to Enhance Safety Alignment in MLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17% improvement in the SIUO safe&effective score compared to GPT-4V.
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual correlated tokens without fine-grained annotations. Specifically, we introduce a token-level visual-anchored reward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLaVA and Qwen, our TPO boosts the performance absolute improvement for hallucination benchmarks.
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
Shilong Li | Yancheng He | Hui Huang | Xingyuan Bu | Jiaheng Liu | Hangyu Guo | Weixun Wang | Jihao Gu | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: NAACL 2025
Shilong Li | Yancheng He | Hui Huang | Xingyuan Bu | Jiaheng Liu | Hangyu Guo | Weixun Wang | Jihao Gu | Wenbo Su | Bo Zheng
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models’ knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
2024
From Bottom to Top: Extending the Potential of Parameter Efficient Fine-Tuning
Jihao Gu | Zelin Wang | Yibo Zhang | Ziji Zhang | Ping Gong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jihao Gu | Zelin Wang | Yibo Zhang | Ziji Zhang | Ping Gong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With the proliferation of large language models, Parameter Efficient Fine-Tuning (PEFT) method, which freeze pre-trained parameters and only fine-tune a few task-specific parameters, are playing an increasingly important role. However, previous work primarily applied uniform operations across all layers of the model, overlooking the fact that different layers in a transformer store different information. In the process of exploration, We find that there is a significant differences in fine-tuning strategies between different layers, and fine-tuning only a subset of layers can even achieve comparable performance. Based on this, we propose the Hybrid LoRA-Prefix Tuning(HLPT) method, which uses enhanced LoRA and Prefix-tuning methods with learnable adaptive mechanism separately for the bottom and top layers, and the Half Hybrid LoRA-Prefix Tuning(H2LPT) method, which goes a step further, reducing the parameter count to nearly half by omitting fine-tuning in the middle layers. Extensive experiments with large language models on various downstream tasks provide strong evidence for the potential of PEFT focusing on different layers’ interactions and the effectiveness of our methods. Furthermore, we validate the robustness of these methods and their advantages in speeding up training convergence, reducing inference time requirements.
Search
Fix author
Co-authors
- Pi Bu 4
- Yancheng He 4
- Shilong Li 4
- Jun Song 4
- Yingyao Wang 4
- Bo Zheng 4
- Jiaheng Liu 3
- Qihang Ai 2
- Xingyuan Bu 2
- Meng Cao 2
- Yue Cao 2
- Hangyu Guo 2
- Hui Huang 2
- Wei Jiang 2
- Yuning Jiang 2
- Wenbo Su 2
- Jingxuan Xing 2
- Yingxiu Zhao 2
- Zekun Zhu 2
- Gang Cao 1
- Peng Chen 1
- Wei Dai 1
- Ranjie Duan 1
- Ping Gong 1
- Chenchen Jing 1
- Kangqin 1
- Yangguang Li 1
- Xiang Li 1
- Jianyu Liu 1
- Yanming Liu 1
- Kyle Qiao 1
- Xingwei Qu 1
- Tengtao Song 1
- Yingshui Tan 1
- Yucheng Wang 1
- Pei Wang 1
- Zelin Wang 1
- Ziming Wang 1
- Weixun Wang 1
- Xinyi Wang 1
- Ke Wang 1
- Zhongpu Wang 1
- Chen Wang 1
- Ziming Wang 1
- Donglai Wei 1
- Yanan Wu 1
- Jingtao Xu 1
- Jiale Yuan 1
- Xiao Zhang 1
- Yibo Zhang 1
- Ziji Zhang 1
- Ming-Liang Zhang 1
- Xiaode Zhang 1
- Zhicheng Zheng 1
- Gong Zhi 1
- Jianke Zhu 1
- Xiaoyong Zhu 1