Yingyao Wang
2026
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present Mobile-R1, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the “Eureka” moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language
Peijie Wang | Ming-Liang Zhang | Jun Cao | Chao Deng | Dekang Ran | Pi Bu | Hongda Sun | Xuan Zhang | Yingyao Wang | Jun Song | Bo Zheng | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2026
Peijie Wang | Ming-Liang Zhang | Jun Cao | Chao Deng | Dekang Ran | Pi Bu | Hongda Sun | Xuan Zhang | Yingyao Wang | Jun Song | Bo Zheng | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. We propose a training paradigm combining Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards, which effectively enforces syntactic correctness and geometric consistency. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce InquireBench, a comprehensive benchmark specifically designed to evaluate mobile agents’ capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose InquireMobile, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. The project page is available at https://bit-aqh.github.io/InquireMobile/homepage/.
2025
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual correlated tokens without fine-grained annotations. Specifically, we introduce a token-level visual-anchored reward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLaVA and Qwen, our TPO boosts the performance absolute improvement for hallucination benchmarks.
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models’ knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
2022
MuGER2: Multi-Granularity Evidence Retrieval and Reasoning for Hybrid Question Answering
Yingyao Wang | Junwei Bao | Chaoqun Duan | Youzheng Wu | Xiaodong He | Tiejun Zhao
Findings of the Association for Computational Linguistics: EMNLP 2022
Yingyao Wang | Junwei Bao | Chaoqun Duan | Youzheng Wu | Xiaodong He | Tiejun Zhao
Findings of the Association for Computational Linguistics: EMNLP 2022
Hybrid question answering (HQA) aims to answer questions over heterogeneous data, including tables and passages linked to table cells. The heterogeneous data can provide different granularity evidence to HQA models, e.t., column, row, cell, and link. Conventional HQA models usually retrieve coarse- or fine-grained evidence to reason the answer. Through comparison, we find that coarse-grained evidence is easier to retrieve but contributes less to the reasoner, while fine-grained evidence is the opposite. To preserve the advantage and eliminate the disadvantage of different granularity evidence, we propose MuGER2, a Multi-Granularity Evidence Retrieval and Reasoning approach. In evidence retrieval, a unified retriever is designed to learn the multi-granularity evidence from the heterogeneous data. In answer reasoning, an evidence selector is proposed to navigate the fine-grained evidence for the answer reader based on the learned multi-granularity evidence. Experiment results on the HybridQA dataset show that MuGER2 significantly boosts the HQA performance. Further ablation analysis verifies the effectiveness of both the retrieval and reasoning designs.
2020
Learning to Decouple Relations: Few-Shot Relation Classification with Entity-Guided Attention and Confusion-Aware Training
Yingyao Wang | Junwei Bao | Guangyi Liu | Youzheng Wu | Xiaodong He | Bowen Zhou | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics
Yingyao Wang | Junwei Bao | Guangyi Liu | Youzheng Wu | Xiaodong He | Bowen Zhou | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics
This paper aims to enhance the few-shot relation classification especially for sentences that jointly describe multiple relations. Due to the fact that some relations usually keep high co-occurrence in the same context, previous few-shot relation classifiers struggle to distinguish them with few annotated instances. To alleviate the above relation confusion problem, we propose CTEG, a model equipped with two novel mechanisms to learn to decouple these easily-confused relations. On the one hand, an Entity -Guided Attention (EGA) mechanism, which leverages the syntactic relations and relative positions between each word and the specified entity pair, is introduced to guide the attention to filter out information causing confusion. On the other hand, a Confusion-Aware Training (CAT) method is proposed to explicitly learn to distinguish relations by playing a pushing-away game between classifying a sentence into a true relation and its confusing relation. Extensive experiments are conducted on the FewRel dataset, and the results show that our proposed model achieves comparable and even much better results to strong baselines in terms of accuracy. Furthermore, the ablation test and case study verify the effectiveness of our proposed EGA and CAT, especially in addressing the relation confusion problem.
Table Fact Verification with Structure-Aware Transformer
Hongzhi Zhang | Yingyao Wang | Sirui Wang | Xuezhi Cao | Fuzheng Zhang | Zhongyuan Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Hongzhi Zhang | Yingyao Wang | Sirui Wang | Xuezhi Cao | Fuzheng Zhang | Zhongyuan Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Verifying fact on semi-structured evidence like tables requires the ability to encode structural information and perform symbolic reasoning. Pre-trained language models trained on natural language could not be directly applied to encode tables, because simply linearizing tables into sequences will lose the cell alignment information. To better utilize pre-trained transformers for table representation, we propose a Structure-Aware Transformer (SAT), which injects the table structural information into the mask of the self-attention layer. A method to combine symbolic and linguistic reasoning is also explored for this task. Our method outperforms baseline with 4.93% on TabFact, a large scale table verification dataset.
Search
Fix author
Co-authors
- Pi Bu 5
- Jun Song 5
- Jihao Gu 4
- Bo Zheng 4
- Qihang Ai 2
- Junwei Bao 2
- Meng Cao 2
- Yue Cao 2
- Xiaodong He 2
- Yancheng He 2
- Wei Jiang 2
- Yuning Jiang 2
- Shilong Li 2
- Youzheng Wu 2
- Jingxuan Xing 2
- Ming-Liang Zhang 2
- Tiejun Zhao (赵铁军) 2
- Yingxiu Zhao 2
- Zekun Zhu 2
- Jun Cao 1
- Xuezhi Cao 1
- Chao Deng 1
- Chaoqun Duan 1
- Xiang Li 1
- Guangyi Liu 1
- Cheng-Lin Liu 1
- Jiaheng Liu 1
- Dekang Ran 1
- Tengtao Song 1
- Wenbo Su 1
- Hongda Sun 1
- Yingshui Tan 1
- Ziming Wang 1
- Peijie Wang 1
- Chen Wang 1
- Ziming Wang 1
- Sirui Wang 1
- Zhongyuan Wang 1
- Donglai Wei 1
- Fei Yin 1
- Jiale Yuan 1
- Xuan Zhang 1
- Hongzhi Zhang 1
- Fuzheng Zhang 1
- Zhicheng Zheng 1
- Bowen Zhou 1
- Xiaoyong Zhu 1