Jiazheng Zhang
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
DARM: Distribution-Aware Reward Modeling by Alleviating Biases from Low Preference-Context Dependency Data
Shaofan Liu | Guoqiang Zhang | Shihan Dou | Huiyuan Zheng | Yiming Zhou | Junjie Ye | Shaowen Wang | Shichun Liu | Jiazheng Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaofan Liu | Guoqiang Zhang | Shihan Dou | Huiyuan Zheng | Yiming Zhou | Junjie Ye | Shaowen Wang | Shichun Liu | Jiazheng Zhang | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reward models (RMs) are the surrogate objectives in reinforcement learning from human feedback (RLHF), and their scores directly steer policy optimization. We show that standard RM training is vulnerable in data subsets where response quality depends only weakly on the context: such instances encourage the RM to ignore the context, leading to context neglect and degraded accuracy. To address this failure mode, we propose Distribution-Aware Reward Modeling (DARM), which augments the RM objective with a conditional mutual information regularizer that maximizes context and the predicted reward conditioned on the response. By explicitly preserving the sensitivity of reward signals to the prompting context, DARM reduces over-reliance on response-only features and improves robustness to contextual variation. Extensive experiments across in-distribution and out-of-distribution settings show that DARM trained RMs deliver more accurate and consistent scoring than strong baselines. We further evaluate its downstream impact in RLHF, where DARM produce better aligned policies. We also demonstrate the necessity of each DARM design component and the impact of key parameters on performance through ablation experiments.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
2025
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
DocFusion: A Unified Framework for Document Parsing Tasks
Mingxu Chai | Ziyu Shen | Chong Zhang | Yue Zhang | Xiao Wang | Shihan Dou | Jihua Kang | Jiazheng Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Mingxu Chai | Ziyu Shen | Chong Zhang | Yue Zhang | Xiao Wang | Shihan Dou | Jihua Kang | Jiazheng Zhang | Qi Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Document parsing involves layout element detection and recognition, essential for extracting information. However, existing methods often employ multiple models for these tasks, leading to increased system complexity and maintenance overhead. While some models attempt to unify detection and recognition, they often fail to address the intrinsic differences in data representations, thereby limiting performance in document processing. Our research reveals that recognition relies on discrete tokens, whereas detection relies on continuous coordinates, leading to challenges in gradient updates and optimization. To bridge this gap, we propose the Gaussian-Kernel Cross-Entropy Loss (GK-CEL), enabling generative frameworks to handle both tasks simultaneously. Building upon GK-CEL, we propose DocFusion, a unified document parsing model with only 0.28B parameters. Additionally, we construct the DocLatex-1.6M dataset to provide high-quality training support. Experimental results show that DocFusion, equipped with GK-CEL, performs competitively across four core document parsing tasks, validating the effectiveness of our unified approach.
Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety
Chenhao Huang | Ziyu Shen | Yicong Ren | Huiyuan Zheng | Jiazheng Zhang | Mingxu Chai | Ming Zhang | Shihan Dou | Fan Mo | Jie Shi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chenhao Huang | Ziyu Shen | Yicong Ren | Huiyuan Zheng | Jiazheng Zhang | Mingxu Chai | Ming Zhang | Shihan Dou | Fan Mo | Jie Shi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Aligning large language models (LLMs) with human preferences is a central challenge for building reliable AI systems. Most existing alignment approaches rely on static signals, such as predefined principles or offline human annotations to guide model behavior toward a fixed approximation of human preferences. However, LLMs can exhibit distributional drift during training, and static alignment mechanisms lack the capacity to adaptively correct misaligned behaviors as they emerge. To address this limitation, we develop a two-stage framework that enables dynamic and continuous alignment. In the first stage, a constitution is continually revised based on observed model behaviors, and models are trained to comply with these evolving principles. In the second stage, this learned constitution is used to guide reinforcement learning, encouraging the model to align with the updated normative signals. We refer to this framework as COCOA: Co-evolution of Constitutions and AI Models. We show that COCOA enables a 7B model to greatly improve safety—raising StrongReject score from 0.741 to 0.935 and Safe-RLHF accuracy from 77.76% to 90.64% without human annotations, reaching performance close to much larger state-of-the-art models.
Multi-Programming Language Sandbox for LLMs
Shihan Dou | Jiazheng Zhang | Jianxiang Zang | Yunbo Tao | Weikang Zhou | Haoxiang Jia | Shichun Liu | Yuming Yang | Shenxi Wu | Zhiheng Xi | Muling Wu | Rui Zheng | Changze Lv | Limao Xiong | Shaoqing Zhang | Lin Zhang | Wenyu Zhan | Rongxiang Weng | Jingang Wang | Xunliang Cai | Yueming Wu | Ming Wen | Yixin Cao | Tao Gui | Xipeng Qiu | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Shihan Dou | Jiazheng Zhang | Jianxiang Zang | Yunbo Tao | Weikang Zhou | Haoxiang Jia | Shichun Liu | Yuming Yang | Shenxi Wu | Zhiheng Xi | Muling Wu | Rui Zheng | Changze Lv | Limao Xiong | Shaoqing Zhang | Lin Zhang | Wenyu Zhan | Rongxiang Weng | Jingang Wang | Xunliang Cai | Yueming Wu | Ming Wen | Yixin Cao | Tao Gui | Xipeng Qiu | Qi Zhang | Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. It also can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we conduct extensive experiments by integrating it into several training and deployment scenarios, and employing it to optimize workflows for a wide range of downstream code tasks. Our goal is to enhance researcher productivity on LLM-based code tasks by simplifying and automating workflows through delegation to MPLSandbox.
2023
Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
Yanfei Dong | Lambert Deng | Jiazheng Zhang | Xiaodong Yu | Ting Lin | Francesco Gelli | Soujanya Poria | Wee Sun Lee
Findings of the Association for Computational Linguistics: EACL 2023
Yanfei Dong | Lambert Deng | Jiazheng Zhang | Xiaodong Yu | Ting Lin | Francesco Gelli | Soujanya Poria | Wee Sun Lee
Findings of the Association for Computational Linguistics: EACL 2023
Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge for document entity classification. We propose KNN-Former, which incorporates a new kind of spatial bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We limit entities’ attention only to their local radius defined by the KNN graph. We also use combinatorial matching to address the one-to-one mapping property that exists in many documents, where one field has only one corresponding entity. Moreover, our method is highly parameter-efficient compared to existing approaches in terms of the number of trainable parameters. Despite this, experiments across various datasets show our method outperforms baselines in most entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as inductive biases to improve extraction accuracy, but existing datasets do not cover these documents. To facilitate future research into these types of documents, we release a new ID document dataset that covers diverse templates and languages. We also release enhanced annotations for an existing dataset.
Search
Fix author
Co-authors
- Tao Gui 6
- Xuan-Jing Huang (黄萱菁) 6
- Shihan Dou 5
- Zhiheng Xi 4
- Qi Zhang 4
- Mingxu Chai 3
- Junjie Ye (叶俊杰) 3
- Qi Zhang 3
- Honglin Guo 2
- Chenhao Huang 2
- Shichun Liu 2
- Xipeng Qiu (邱锡鹏) 2
- Ziyu Shen 2
- Yuming Yang 2
- Guoqiang Zhang 2
- Huiyuan Zheng 2
- Dingwei Zhu 2
- Xunliang Cai 1
- Yixin Cao 1
- Tinggang Chen 1
- Wenxiang Chen 1
- Lambert Deng 1
- Yanfei Dong 1
- Minghe Gao 1
- Francesco Gelli 1
- Xin Guo 1
- Wei He 1
- Boyang Hong 1
- Baodai Huang 1
- Caishuang Huang 1
- Jixuan Huang 1
- Jiaming Ji 1
- Haoxiang Jia 1
- Senjie Jin 1
- Jihua Kang 1
- Wee Sun Lee 1
- Guohao Li 1
- Nijun Li 1
- Yun Li 1
- Ting Lin 1
- Chenyu Liu 1
- Dongrui Liu 1
- Jiaqi Liu 1
- Shaofan Liu 1
- Zhonghang Lu 1
- Changze Lv 1
- Fan Mo 1
- Soujanya Poria 1
- Yicong Ren 1
- Jie Shi 1
- Jiajun Sun 1
- Yunbo Tao 1
- Jingang Wang 1
- Junzhe Wang 1
- Shaowen Wang 1
- Xiao Wang 1
- Yuhui Wang 1
- Yuran Wang 1
- Ming Wen 1
- Rongxiang Weng 1
- Muling Wu 1
- Shenxi Wu 1
- Yueming Wu 1
- Limao Xiong 1
- Dingwen Yang 1
- Xiaodong Yu 1
- Jianxiang Zang 1
- Wenyu Zhan 1
- Chong Zhang 1
- Lin Zhang 1
- Ming Zhang 1
- Ming Zhang 1
- Qi Zhang 1
- Shaoqing Zhang 1
- Yue Zhang 1
- Yunke Zhang 1
- Zhihao Zhang 1
- Rui Zheng 1
- Enyu Zhou 1
- Weikang Zhou 1
- Yiming Zhou 1