Jun Song
2026
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present Mobile-R1, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the “Eureka” moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language
Peijie Wang | Ming-Liang Zhang | Jun Cao | Chao Deng | Dekang Ran | Pi Bu | Hongda Sun | Xuan Zhang | Yingyao Wang | Jun Song | Bo Zheng | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2026
Peijie Wang | Ming-Liang Zhang | Jun Cao | Chao Deng | Dekang Ran | Pi Bu | Hongda Sun | Xuan Zhang | Yingyao Wang | Jun Song | Bo Zheng | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. We propose a training paradigm combining Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards, which effectively enforces syntactic correctness and geometric consistency. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks.
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qihang Ai | Pi Bu | Yue Cao | Yingyao Wang | Jihao Gu | Jingxuan Xing | Zekun Zhu | Wei Jiang | Zhicheng Zheng | Jun Song | Yuning Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce InquireBench, a comprehensive benchmark specifically designed to evaluate mobile agents’ capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose InquireMobile, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. The project page is available at https://bit-aqh.github.io/InquireMobile/homepage/.
Unified Thinker: A General Reasoning Core for Image Generation
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sashuai Zhou | Qiang Zhou | Jijin Hu | Hanqing Yang | Yue Cao | Junpeng Ma | Yinchao Ma | Jun Song | Tiezheng Ge | Cheng Yu | Bo Zheng | Zhou Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning–execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
2025
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
Jihao Gu | Yingyao Wang | Pi Bu | Chen Wang | Ziming Wang | Tengtao Song | Donglai Wei | Jiale Yuan | Yingxiu Zhao | Yancheng He | Shilong Li | Jiaheng Liu | Meng Cao | Jun Song | Yingshui Tan | Xiang Li | Wenbo Su | Xiaoyong Zhu | Bo Zheng
Findings of the Association for Computational Linguistics: ACL 2025
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models’ knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng | Jiale Yuan | Pi Bu | Peijie Wang | Zhong-Zhi Li | Jian Xu | Xiao-Hui Li | Yuan Gao | Jun Song | Bo Zheng | Cheng-Lin Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chao Deng | Jiale Yuan | Pi Bu | Peijie Wang | Zhong-Zhi Li | Jian Xu | Xiao-Hui Li | Yuan Gao | Jun Song | Bo Zheng | Cheng-Lin Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark—LongDocURL—integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed- source models across 26 different configurations, revealing critical performance gaps in this field. The code and data: https://github.com/dengc2023/LongDocURL.
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Jihao Gu | Yingyao Wang | Meng Cao | Pi Bu | Jun Song | Bo Zheng | Yancheng He | Shilong Li
Findings of the Association for Computational Linguistics: EMNLP 2025
Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual correlated tokens without fine-grained annotations. Specifically, we introduce a token-level visual-anchored reward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLaVA and Qwen, our TPO boosts the performance absolute improvement for hallucination benchmarks.
2024
GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation
Shihao Cai | Keqin Bao | Hangyu Guo | Jizhi Zhang | Jun Song | Bo Zheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Shihao Cai | Keqin Bao | Hangyu Guo | Jizhi Zhang | Jun Song | Bo Zheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models have seen widespread adoption in math problem-solving, yet for geometry problems, which often necessitate visual aids even for humans, the most advanced multi-modal models still struggle to effectively utilize image information. High-quality data is crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset. Experimental results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on the MathVista and MathVision benchmarks. The code is available at https://anonymous.4open.science/r/GeoGPT4V-08B2.
Search
Fix author
Co-authors
- Pi Bu 6
- Bo Zheng 6
- Yingyao Wang 5
- Jihao Gu 4
- Yue Cao 3
- Qihang Ai 2
- Meng Cao 2
- Yancheng He 2
- Wei Jiang 2
- Yuning Jiang 2
- Shilong Li 2
- Cheng-Lin Liu 2
- Peijie Wang 2
- Jingxuan Xing 2
- Jiale Yuan 2
- Ming-Liang Zhang 2
- Yingxiu Zhao 2
- Zekun Zhu 2
- Keqin Bao 1
- Shihao Cai 1
- Jun Cao 1
- Chao Deng 1
- Chao Deng 1
- Yuan Gao 1
- Tiezheng Ge 1
- Hangyu Guo 1
- Jijin Hu 1
- Xiang Li 1
- Zhong-Zhi Li 1
- Xiao-Hui Li 1
- Jiaheng Liu 1
- Junpeng Ma 1
- Yinchao Ma 1
- Dekang Ran 1
- Tengtao Song 1
- Wenbo Su 1
- Hongda Sun 1
- Yingshui Tan 1
- Chen Wang 1
- Ziming Wang 1
- Ziming Wang 1
- Donglai Wei 1
- Jian Xu 1
- Hanqing Yang 1
- Fei Yin 1
- Cheng Yu 1
- Xuan Zhang 1
- Jizhi Zhang 1
- Zhou Zhao 1
- Zhicheng Zheng 1
- Bo Zheng 1
- Sashuai Zhou 1
- Qiang Zhou (周强) 1
- Xiaoyong Zhu 1