Chenglin Cai
2026
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Existing Chinese preference datasets suffer from limited scale, restricted domain coverage, and insufficiently rigorous data validation. Human annotation significantly limits the scalability of human preference datasets. As a result, Chinese Alignment and Chinese Reward Models (CRM) have not yet been thoroughly explored. To address these challenges, we design an LLM-based data annotation pipeline with no human intervention. Based on this pipeline, we curate COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset consisting of 1M Chinese preference pairs and 92k carefully curated Chinese queries across diverse domains, including Chat, Coding, Maths, and others. We conduct experiments to verify the quality of COIG-P from two perspectives. (1) COIG-P brings significant performance improvements for the Qwen2/2.5 and Infinity-Instruct model series on AlignBench through DPO, with gains ranging from 2% to 12%. Furthermore, it significantly outperforms other existing Chinese preference datasets. (2) We train an 8B-sized CRM and manually annotate a Chinese Reward Benchmark (CRBench). Our CRM demonstrates robust scoring ability on CRBench. In addition, in practical data construction experiments, the quality of the data constructed by our CRM is comparable to that produced by GPT-4o.
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
Quyu Kong | Xu Zhang | Zhenyu Yang | Nolan Gao | Chen Liu | Panrong Tong | Chenglin Cai | Hanzhang Zhou | Jianan Zhang | Liangyu Chen | Zhidan Liu | Steven Hoi | Yue Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Quyu Kong | Xu Zhang | Zhenyu Yang | Nolan Gao | Chen Liu | Panrong Tong | Chenglin Cai | Hanzhang Zhou | Jianan Zhang | Liangyu Chen | Zhidan Liu | Steven Hoi | Yue Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While AndroidWorld has become the dominant mobile-use benchmark due to its reproducible environment and deterministic evaluation, recent agents achieving over 90% success rates indicate saturation and motivate the need for greater challenge. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark with 201 tasks across 20 applications that reflects real-world usage through long-horizon, cross-application workflows requiring nearly twice as many steps (27.8 vs. 14.3) and featuring significantly more multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. MobileWorld balances production-grade utility and reproducible evaluation using open-source alternatives to industry standards (e.g., Mattermost for Slack), enabling full observability through source code modification and direct database access. Beyond standard GUI manipulation, MobileWorld introduces novel task categories including agent-user interaction and Model Context Protocol (MCP)-augmented tasks for evaluating agents in user-aware, hybrid-tool scenarios. We develop a planner-executor framework with extended action spaces supporting user interactions and MCP calls. Results show a sharp performance drop from AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting substantial room for future research.
Search
Fix author
Co-authors
- Yuelin Bai 1
- Xingyuan Bu 1
- Liangyu Chen 1
- Xeron Du 1
- Boyu Feng 1
- Nolan Gao 1
- Shuyue Guo 1
- Steven Hoi 1
- Wenhao Huang 1
- Yuchen Eleanor Jiang 1
- Quyu Kong 1
- Yunwen Li 1
- Yiming Liang 1
- Chenghua Lin 1
- Qunshu Lin 1
- Chen Liu 1
- Jiaheng Liu 1
- Jie Liu 1
- Minghao Liu 1
- Zhidan Liu 1
- Tyler Loakman 1
- Ding Pan 1
- Xingwei Qu 1
- Haoran Que 1
- JinCheng Ren 1
- Panrong Tong 1
- Guoyin Wang 1
- Shenzhi Wang 1
- Tiannan Wang 1
- Yue Wang 1
- Zekun Moore Wang 1
- Zili Wang 1
- Siwei Wu 1
- Jian Yang 1
- Zhenyu Yang 1
- Zhouliang Yu 1
- Huaqing Yuan 1
- Ge Zhang 1
- Jianan Zhang 1
- Xu Zhang 1
- Tianyu Zheng 1
- Hanzhang Zhou 1
- Wangchunshu Zhou 1