Ming-Liang Zhang
2026
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihao Gu | Qihang Ai | Yingyao Wang | Pi Bu | Jingxuan Xing | Yue Cao | Zekun Zhu | Wei Jiang | Ziming Wang | Yingxiu Zhao | Ming-Liang Zhang | Jun Song | Yuning Jiang | Bo Zheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present Mobile-R1, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the “Eureka” moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
2025
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
Zhongzhi Li | Ming-Liang Zhang | Pei-Jie Wang | Jian Xu | Rui-Song Zhang | Yin Fei | Zhi-Long Ji | Jin-Feng Bai | Zhen-Ru Pan | Jiaxin Zhang | Cheng-Lin Liu
Proceedings of the 31st International Conference on Computational Linguistics
Zhongzhi Li | Ming-Liang Zhang | Pei-Jie Wang | Jian Xu | Rui-Song Zhang | Yin Fei | Zhi-Long Ji | Jin-Feng Bai | Zhen-Ru Pan | Jiaxin Zhang | Cheng-Lin Liu
Proceedings of the 31st International Conference on Computational Linguistics
With the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Although datasets such as MathVista have been introduced for evaluating mathematical capabilities in multimodal scenarios, there remains a lack of evaluation tools and datasets tailored for fine-grained assessment in Chinese K12 education. To systematically evaluate the ability of multimodal large models to solve Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark (CMMaTH), containing 23,856 multimodal K12 math related questions, making it the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH includes questions ranging from elementary to high school levels, offering greater diversity in problem types, solution goals, visual elements, detailed knowledge points, and standard solution annotations. To facilitate stable, fast, and cost-free model evaluation, we have developed an open-source tool called GradeGPT, which is integrated with the CMMaTH dataset. Our data and code are available at https://github.com/zzli2022/CMMaTH.
2024
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
Jiaxin Zhang | Zhong-Zhi Li | Ming-Liang Zhang | Fei Yin | Cheng-Lin Liu | Yashar Moshfeghi
Findings of the Association for Computational Linguistics: ACL 2024
Jiaxin Zhang | Zhong-Zhi Li | Ming-Liang Zhang | Fei Yin | Cheng-Lin Liu | Yashar Moshfeghi
Findings of the Association for Computational Linguistics: ACL 2024
Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented sub- set of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem
Zhong-Zhi Li | Ming-Liang Zhang | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2024
Zhong-Zhi Li | Ming-Liang Zhang | Fei Yin | Cheng-Lin Liu
Findings of the Association for Computational Linguistics: ACL 2024
Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. We have made our code and data publicly available.