Weinan Zhang
University College London
Other people with similar names: Weinan Zhang
Unverified author pages with similar names: Weinan Zhang
2026
Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
Zheng Wu | Xingyu Lou | Xinbei Ma | Yansi Li | Weiwen Liu | Weinan Zhang | Jun Wang | Zhuosheng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Zheng Wu | Xingyu Lou | Xinbei Ma | Yansi Li | Weiwen Liu | Weinan Zhang | Jun Wang | Zhuosheng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability–plasticity dilemma.In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation.Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics.We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability–plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.
Attribution-Based Analysis and Optimization of Modular Agentic Workflows
Yingxuan Yang | Bo Huang | Siyuan Qi | Chao Feng | Haoyi Hu | Yuxuan Zhu | Jinbo Hu | Haoran Zhao | Ziyi He | Xiao Liu | ZongYu Wang | Muning Wen | Lin Qiu | Xuezhi Cao | Xunliang Cai | Yong Yu | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yingxuan Yang | Bo Huang | Siyuan Qi | Chao Feng | Haoyi Hu | Yuxuan Zhu | Jinbo Hu | Haoran Zhao | Ziyi He | Xiao Liu | ZongYu Wang | Muning Wen | Lin Qiu | Xuezhi Cao | Xunliang Cai | Yong Yu | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Agentic workflows solve complex tasks by orchestrating modular components (e.g., planning, reasoning, action, reflection) built on top of LLM backbones. A practical but underexplored question is model allocation: given a fixed workflow decomposition and a pool of candidate LLMs, which components should be upgraded (and with which models) to upgrade task performance, and how can we attribute gains to individual upgrades and their interactions?We present ShapleyFlow, a cooperative game theoretic framework that models component upgrades as players and evaluates component coalitions to compute Shapley values. This yields interaction-aware attribution and supports Shapley-guided configuration recommendation for model allocation under a fixed workflow structure.We further introduce CapaBench, a benchmark of 1,500+ tasks across seven domains (shopping, navigation, ticketing, mathematics, operating systems, robotic coordination, and automated theorem proving).Across 9 representative LLMs and all 24 upgrade coalitions in a 4-component workflow, ShapleyFlow provides (i) principled, interaction-aware attribution for modular workflows and (ii) actionable model-allocation recommendations that improve over strong single-model baselines.
Progra: Progress-Aware Reinforcement Learning for Multi-Turn Function Calling
Huacan Chai | Zijie Cao | Maolin Ran | Yingxuan Yang | Jianghao Lin | Xin Peng | Hairui Wang | Renjie Ding | Ziyu Wan | Muning Wen | Weiwen Liu | Weinan Zhang | Fei Huang | Ying Wen
Findings of the Association for Computational Linguistics: ACL 2026
Huacan Chai | Zijie Cao | Maolin Ran | Yingxuan Yang | Jianghao Lin | Xin Peng | Hairui Wang | Renjie Ding | Ziyu Wan | Muning Wen | Weiwen Liu | Weinan Zhang | Fei Huang | Ying Wen
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce Progra, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. Progra combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that Progra significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling. Our code is available at https://github.com/FatCatCHC/Progra .
LoopTool: Closing the Data–Training Loop for Robust LLM Tool Calls
Kangning Zhang | Weiwen Liu | Wenxiang Jiao | Kounianhua Du | Yuan Lu | Weinan Zhang | Yong Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kangning Zhang | Weiwen Liu | Wenxiang Jiao | Kounianhua Du | Yuan Lu | Weinan Zhang | Yong Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines, where data generation and model training are executed as two separate, non-interactive processes. This approach fails to focus on the model’s specific weaknesses adaptively and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively evolves both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model’s mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process is tightly integrated with reinforcement learning training and operates within a cost-efficient, open-source ecosystem, thereby eliminating reliance on costly APIs. Experiments show that LoopTool-8B significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
Jianghao Lin | Yuanyuan Shi | Xin Peng | Renjie Ding | Hairui Wang | Yuxuan Peng | Bizhe Bai | Weixi Song | Fengshuo Bai | Huacan Chai | Weinan Zhang | Fei Huang | Ying Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianghao Lin | Yuanyuan Shi | Xin Peng | Renjie Ding | Hairui Wang | Yuxuan Peng | Bizhe Bai | Weixi Song | Fengshuo Bai | Huacan Chai | Weinan Zhang | Fei Huang | Ying Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with ToolPRM, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows “explore more but retain less”, since early JSON errors are unrecoverable.
A Survey of Large Language Model-Based Search Agents
Yunjia Xi | Jianghao Lin | Yongzhao Xiao | Zheli Zhou | Rong Shan | Te Gao | Jiachen Zhu | Weiwen Liu | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yunjia Xi | Jianghao Lin | Yongzhao Xiao | Zheli Zhou | Rong Shan | Te Gao | Jiachen Zhu | Weiwen Liu | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environment context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI’s Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field.
ACE-Router: Generalizing History-Aware Routing from MCP Tools to the Agent Web
Zhiyuan Yao | Zishan Xu | Yifu Guo | Zhiguang Han | Cheng Yang | Shuo Zhang | Weinan Zhang | Xingshan Zeng | Weiwen Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiyuan Yao | Zishan Xu | Yifu Guo | Zhiguang Han | Cheng Yang | Shuo Zhang | Weinan Zhang | Xingshan Zeng | Weiwen Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the rise of the Agent Web and Model Context Protocol (MCP), the agent ecosystem is evolving into an open collaborative network, exponentially increasing accessible tools. However, current architectures face severe scalability and generality bottlenecks. To address this, we propose ACE-Router, a pipeline for training history-aware routers to empower precise navigation in large-scale ecosystems. By leveraging a dependency-rich candidate Graph to synthesize multi-turn trajectories, we effectively train routers with dynamic context understanding to create the plug-and-play Light Routing Agent. Experiments on the real-world benchmarks MCP-Universe and MCP-Mark demonstrate superior performance. Notably, ACE-Router exhibits critical properties for the future Agent Web: it not only generalizes to multi-agent collaboration with minimal adaptation but also maintains exceptional robustness against noise and scales effectively to massive candidate spaces. These findings provide a strong empirical foundation for universal orchestration in open-ended ecosystems.Our code is available at https://github.com/euyis1019/ACE-Router.
A Comprehensive Survey of Process Reward Models: Data Generation, Model Construction, and Usage
Congmin Zheng | Jiachen Zhu | Zhuoying Ou | Yuxiang Chen | Kangning Zhang | Rong Shan | Zeyu Zheng | Mengyue Yang | Jianghao Lin | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Congmin Zheng | Jiachen Zhu | Zhuoying Ou | Yuxiang Chen | Kangning Zhang | Rong Shan | Zeyu Zheng | Mengyue Yang | Jianghao Lin | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have advanced reasoning ability, yet conventional alignment remains dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks
Lingyue Fu | Hao Guan | Bolun Zhang | Haowei Yuan | Yaoming Zhu | Lin Qiu | ZongYu Wang | Xuezhi Cao | Xunliang Cai | Weiwen Liu | Weinan Zhang | Yong Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lingyue Fu | Hao Guan | Bolun Zhang | Haowei Yuan | Yaoming Zhu | Lin Qiu | ZongYu Wang | Xuezhi Cao | Xunliang Cai | Weiwen Liu | Weinan Zhang | Yong Yu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation and ensures superior data quality. It achieves a 78.55% validity yield, significantly surpassing the 31.7% retention rate of SWE-bench-Verified. Extensive experiments with state-of-the-art LLMs reveal a significant capability misalignment, evidenced by distinct ranking shifts across cognitive dimensions. This indicates that coding proficiency is non-monolithic, as strength in one aspect does not necessarily translate to others. These findings underscore the necessity of our fine-grained taxonomy in diagnosing model deficiencies and offer a sustainable, rigorous framework for evolving code intelligence. Code of CorePipe framework and data of CoreCodeBench are available in https://github.com/AGI-Eval-Official/CoreCodeBench and https://huggingface.co/collections/tubehhh/corecodebench.
ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
Jihong Wang | Jiamu Zhou | Weiming Zhang | Teng Wang | Weiwen Liu | Zhuosheng Zhang | Xingyu Lou | Weinan Zhang | Huarong Deng | Jun Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Jihong Wang | Jiamu Zhou | Weiming Zhang | Teng Wang | Weiwen Liu | Zhuosheng Zhang | Xingyu Lou | Weinan Zhang | Huarong Deng | Jun Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.
2025
Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation
Kounianhua Du | Hanjing Wang | Jianxing Liu | Jizheng Chen | Xinyi Dai | Yasheng Wang | Ruiming Tang | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Kounianhua Du | Hanjing Wang | Jianxing Liu | Jizheng Chen | Xinyi Dai | Yasheng Wang | Ruiming Tang | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
To address these limitations, we propose BDC, a novel framework that Boosts reasoning exploration via multi-agent collaboration, Disentangles heterogeneous data into specialized experts, and Customizes solutions through dynamic model composition. BDC integrates a Monte Carlo Tree-of-Agents algorithm, where multiple LLMs mutually verify and refine reasoning paths through reflection-guided pruning, enabling efficient exploration of high-quality solutions. To handle data diversity, we cluster problems by latent semantics, train composable LoRA experts on each cluster, and deploy an input-aware hypernetwork to dynamically merge these experts into tailored solvers. Experiments on APPS and CodeContest benchmarks demonstrate BDC’s superiority: it achieves up to 73.8% accuracy on hard problems, outperforming state-of-the-art methods like LATS and RethinkMCTS by 9–15%. This work lays the groundwork for advancing LLM capabilities in complex reasoning tasks, offering a novel System2-to-System1 solution.
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Jiachen Zhu | Congmin Zheng | Jianghao Lin | Kounianhua Du | Ying Wen | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Jiachen Zhu | Congmin Zheng | Jianghao Lin | Kounianhua Du | Ying Wen | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies the OOD issues including step OOD, arising from differences in reasoning patterns across model types and sizes, and question OOD, due to dataset shifts between training and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps for PRM as a warmup to stimulate its potential to judge target steps, improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetreivalPRM model, establishing a new standard for PRM performance.
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation
Qingyao Li | Xinyi Dai | Xiangyang Li | Weinan Zhang | Yasheng Wang | Ruiming Tang | Yong Yu
Findings of the Association for Computational Linguistics: ACL 2025
Qingyao Li | Xinyi Dai | Xiangyang Li | Weinan Zhang | Yasheng Wang | Ruiming Tang | Yong Yu
Findings of the Association for Computational Linguistics: ACL 2025
Code generation is a critical reasoning task for large language models (LLMs). Recent advancements have focused on optimizing the thought process of code generation, achieving significant improvements. However, such thought process lacks effective process supervision, making it hard to optimize the thoughts. Although Process Reward Models (PRMs) have been widely established in mathematical reasoning, building a code PRM is still not trivial for the gap between thoughts to code. In this paper, we propose CodePRM, a novel approach that leverages the code execution feedback to build a code PRM. Specifically, we first collect a large dataset of thought traces, where each thought step is labeled with their derived code’ pass rates, accompanied by the corresponding code snippets, and execution feedback. During training, we train a PRM to take both the reasoning process and code execution feedback as input to score individual thought steps, enabling it to leverage code execution results to distinguish between high-quality and low-quality thought steps. Finally, to use the PRM during inference, we develop a Generate-Verify-Refine (GVR) pipeline where the CodePRM serves as a process verifier to dynamically identify and correct errors in the thought process during code search. Experimental results demonstrate that CodePRM with the inference algorithm outperforms strong baselines, significantly enhancing code generation performance. Further analysis reveals the key factors for building a code PRM.
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Assistant Scenarios
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Jun Wang | Jiamu Zhou | Xihuai Wang | Xiaoyun Mo | Haoyu Zhang | Qiqiang Lin | Cheng Jin | Muning Wen | Weinan Zhang | Qiuying Peng | Jun Wang
Findings of the Association for Computational Linguistics: ACL 2025
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs’ function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.
NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging
Weiming Zhang | Qingyao Li | Xinyi Dai | Jizheng Chen | Kounianhua Du | Weiwen Liu | Yasheng Wang | Ruiming Tang | Yong Yu | Weinan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Weiming Zhang | Qingyao Li | Xinyi Dai | Jizheng Chen | Kounianhua Du | Weiwen Liu | Yasheng Wang | Ruiming Tang | Yong Yu | Weinan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Debugging is a critical aspect of LLM’s coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
Qingyao Li | Wei Xia | Xinyi Dai | Kounianhua Du | Weiwen Liu | Yasheng Wang | Ruiming Tang | Yong Yu | Weinan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Qingyao Li | Wei Xia | Xinyi Dai | Kounianhua Du | Weiwen Liu | Yasheng Wang | Ruiming Tang | Yong Yu | Weinan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tree search methods have demonstrated impressive performance in code generation. Previous methods combine tree search with reflection that summarizes past mistakes to achieve iterative improvement. However, these methods face significant challenges. First, they search directly within the code language space, neglecting the underlying reasoning process critical for effective code generation. Second, reflection-based approaches merely accumulate historical errors in memory without providing correct reasoning pathways, making it difficult for subsequent search iterations to identify optimal solutions, resulting in decreased search quality. In this work, we propose RethinkMCTS, a framework that systematically explores and refines the reasoning process for code generation. Specifically, we employ MCTS to search for thoughts before code generation and integrate MCTS with a refinement mechanism called rethink, which incorporates fine-grained code execution feedback to refine erroneous thoughts during the search. It ensures the search path aligns with better reasoning, improving overall search quality. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-based and feedback-enhanced code generation baselines.
DebateCoder: Towards Collective Intelligence of LLMs via Test Case Driven LLM Debate for Code Generation
Jizheng Chen | Kounianhua Du | Xinyi Dai | Weiming Zhang | Xihuai Wang | Yasheng Wang | Ruiming Tang | Weinan Zhang | Yong Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jizheng Chen | Kounianhua Du | Xinyi Dai | Weiming Zhang | Xihuai Wang | Yasheng Wang | Ruiming Tang | Weinan Zhang | Yong Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the impressive reasoning and text generation capabilities of large language models (LLMs), methods leveraging multiple LLMs to debate each other have garnered increasing attention. However, existing debate-based approaches remain limited in effectiveness in structured and detailed domains represented by code generation due to several reasons: 1) Reliance on different instances of the same LLM for debate, neglecting the potential benefits of integrating diverse models with varied internal knowledge for more comprehensive code generation, 2) under-utilization of test cases, and 3) reliance on third-party LLM moderators for result consolidation and decision-making, probably introducing hallucinations and judgment errors. To address these challenges, we propose DebateCoder to collect intelligence of LLMs via test case-driven debate for code generation. In DebateCoder, test cases serve as a medium for models to analyze code and identify bugs, while opposing models generate test cases to challenge each other’s code during the debate process. These test cases, along with their execution results, are elaborately leveraged to refine and enhance the code through a novel contrastive analysis process. Furthermore, DebateCoder leverages test case outcomes to assess code quality and determine convergence criteria. Unlike previous approaches, DebateCoder emphasizes the collaborative improvement of both models through competitive debate and interactive analysis. Abundant experimental results on two datasets demonstrate the effectiveness of DebateCoder.
Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Shao Zhang | Xihuai Wang | Wenhao Zhang | Chaoran Li | Junru Song | Tingyu Li | Lin Qiu | Xuezhi Cao | Xunliang Cai | Wen Yao | Weinan Zhang | Xinbing Wang | Ying Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shao Zhang | Xihuai Wang | Wenhao Zhang | Chaoran Li | Junru Song | Tingyu Li | Lin Qiu | Xuezhi Cao | Xunliang Cai | Wen Yao | Weinan Zhang | Xinbing Wang | Ying Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent *System 1* and *System 2* methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates *System 1* and *System 2* for efficient real-time simultaneous human-AI collaboration. DPT-Agent’s *System 1* uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent’s *System 2* integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
Search
Fix author
Co-authors
- Yong Yu 11
- Weiwen Liu 9
- Kounianhua Du 6
- Xinyi Dai 5
- Jianghao Lin 5
- Ruiming Tang 5
- Yasheng Wang 5
- Ying Wen 4
- Xunliang Cai 3
- Xuezhi Cao 3
- Jizheng Chen 3
- Qingyao Li 3
- Lin Qiu 3
- Jun Wang 3
- Jun Wang 3
- Xihuai Wang 3
- Muning Wen 3
- Weiming Zhang 3
- Jiachen Zhu 3
- Huacan Chai 2
- Renjie Ding 2
- Fei Huang 2
- Xingyu Lou 2
- Xin Peng 2
- Rong Shan 2
- Hairui Wang 2
- ZongYu Wang 2
- Yingxuan Yang 2
- Kangning Zhang 2
- Zhuosheng Zhang 2
- Congmin Zheng 2
- Jiamu Zhou 2
- Bizhe Bai 1
- Fengshuo Bai 1
- Zijie Cao 1
- Yuxiang Chen 1
- Huarong Deng 1
- Chao Feng (冯超) 1
- Lingyue Fu 1
- Te Gao 1
- Hao Guan 1
- Yifu Guo 1
- Zhiguang Han 1
- Ziyi He 1
- Haoyi Hu 1
- Jinbo Hu 1
- Bo Huang 1
- Wenxiang Jiao 1
- Cheng Jin 1
- Chaoran Li 1
- Tingyu Li 1
- Xiangyang Li 1
- Yansi Li 1
- Qiqiang Lin 1
- Jianxing Liu 1
- Xiao Liu 1
- Yuan Lu 1
- Xinbei Ma 1
- Xiaoyun Mo 1
- Zhuoying Ou 1
- Qiuying Peng 1
- Yuxuan Peng 1
- Siyuan Qi 1
- Maolin Ran 1
- Yuanyuan Shi 1
- Junru Song 1
- Weixi Song 1
- Ziyu Wan 1
- Hanjing Wang 1
- Jihong Wang 1
- Teng Wang 1
- Xinbing Wang 1
- Zheng Wu 1
- Yunjia Xi 1
- Wei Xia 1
- Yongzhao Xiao 1
- Zishan Xu 1
- Cheng Yang 1
- Mengyue Yang 1
- Wen Yao 1
- Zhiyuan Yao 1
- Haowei Yuan 1
- Xingshan Zeng 1
- Bolun Zhang 1
- Haoyu Zhang 1
- Shao Zhang 1
- Shuo Zhang 1
- Wenhao Zhang 1
- Haoran Zhao 1
- Zeyu Zheng 1
- Zheli Zhou 1
- Yaoming Zhu 1
- Yuxuan Zhu 1