Kechi Zhang
2026
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Yihong Dong | Xue Jiang | Yongding Tao | Huanyu Liu | Kechi Zhang | Lili Mou | Rongyu Cao | Yingwei MA | Jue Chen | Binhua Li | Zhi Jin | Fei Huang | Yongbin Li | Ge Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yihong Dong | Xue Jiang | Yongding Tao | Huanyu Liu | Kechi Zhang | Lili Mou | Rongyu Cao | Yingwei MA | Jue Chen | Binhua Li | Zhi Jin | Fei Huang | Yongbin Li | Ge Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM’s immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose R-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. R-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, R-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2%. Moreover, the analysis of Pass@k curves indicates that R-PLUS effectively resolves the capability boundary collapse problem.
SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution
Zhenyu He | Qingping Yang | Wei Shen | Xiaojian Zhong | Kechi Zhang | Chenxin An | Wenlei Shi | Tianle Cai | Di He | Jiaze Chen | Jingjing Xu
Findings of the Association for Computational Linguistics: ACL 2026
Zhenyu He | Qingping Yang | Wei Shen | Xiaojian Zhong | Kechi Zhang | Chenxin An | Wenlei Shi | Tianle Cai | Di He | Jiaze Chen | Jingjing Xu
Findings of the Association for Computational Linguistics: ACL 2026
Automated software engineering, particularly resolving real-world issues on benchmarks like SWE-bench, remains a significant challenge for Large Language Models (LLMs). To address this, we introduce SWE-Swiss, a two-phase training recipe that systematically develops these capabilities. Our approach first decomposes issue resolution into three core skills: Localization, Repair, and Unit Test Generation. In the first phase, we perform multi-task Supervised Fine-Tuning (SFT) on three new, meticulously curated datasets to build a versatile foundation. The second phase applies targeted Reinforcement Learning (RL), using direct feedback from test execution to boost the critical skill of code repair. The resulting model, SWE-Swiss-32B, establishes a new state-of-the-art for open-source models in its size class, achieving a 60.2% score on the SWE-bench Verified benchmark and placing it in the same top-tier performance bracket as much larger models. Finally, we show that despite its specialized training, SWE-Swiss-32B demonstrates strong generalization to other common LLM benchmarks. To accelerate research in the community, we are open-sourcing the models and our complete training datasets.
KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development?
Xue Jiang | Ge Li | Jiaru Qian | Xianjie Shi | Chenjie Li | Hao Zhu | Ziyu Wang | Jielun Zhang | Zeyu Zhao | Kechi Zhang | Jia Li | Wenpin Jiao | Zhi Jin | Yihong Dong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xue Jiang | Ge Li | Jiaru Qian | Xianjie Shi | Chenjie Li | Hao Zhu | Ziyu Wang | Jielun Zhang | Zeyu Zhao | Kechi Zhang | Jia Li | Wenpin Jiao | Zhi Jin | Yihong Dong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excel at general programming but struggle with domain-specific software development. This gap motivates research into domain specialization methods that enable LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-bench, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-bench contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice Q A). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-bench requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from the corpora to solve evaluation tasks. Our evaluations reveal that KOCO-bench poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-bench, evaluation code, and baselines to advance further research at https://github.com/jiangxxxue/KOCO-bench.
2025
Revisit Self-Debugging with Self-Generated Tests for Code Generation
Xiancai Chen | Zhengwei Tao | Kechi Zhang | Changzhi Zhou | Xinyu Zhang | Wanli Gu | Yuanpeng He | Mengdi Zhang | Xunliang Cai | Haiyan Zhao | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiancai Chen | Zhengwei Tao | Kechi Zhang | Changzhi Zhou | Xinyu Zhang | Wanli Gu | Yuanpeng He | Mengdi Zhang | Xunliang Cai | Haiyan Zhao | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated significant advancements in code generation, yet they still face challenges when tackling tasks that extend beyond their basic capabilities. Recently, the concept of self-debugging has been proposed as a way to enhance code generation performance by leveraging execution feedback from tests. However, the availability of high-quality tests in real-world scenarios is often limited. In this context, self-debugging with self-generated tests emerges as a promising solution, though its limitations and practical potential have not been fully explored. To address this gap, we investigate the efficacy of self-debugging in code generation tasks. We propose and analyze two distinct paradigms for the self-debugging process: post-execution and in-execution self-debugging. Our findings reveal that post-execution self-debugging struggles with the test bias introduced by self-generated tests, which can lead to misleading feedback. In contrast, in-execution self-debugging enables LLMs to mitigate this bias and leverage intermediate states during program execution. By focusing on runtime information rather than relying solely on potentially flawed self-generated tests, this approach demonstrates significant promise for improving the robustness and accuracy of LLMs in code generation tasks.
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code
Kechi Zhang | Ge Li | Yihong Dong | Jingjing Xu | Jun Zhang | Jing Su | Yongfei Liu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kechi Zhang | Ge Li | Yihong Dong | Jingjing Xu | Jun Zhang | Jing Su | Yongfei Liu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on powerful models such as GPT-4. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points
Kechi Zhang | Ge Li | Jia Li | Yihong Dong | Jia Li | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2025
Kechi Zhang | Ge Li | Jia Li | Yihong Dong | Jia Li | Zhi Jin
Findings of the Association for Computational Linguistics: ACL 2025
Code generation models have shown significant potential for automating programming tasks. However, the challenge of generating accurate and reliable code persists due to the highly complex and long-reasoning nature of the task. Even state-of-the-art models often fail in code generation due to small errors, which can drastically affect the overall functionality of code. Our study identifies that current models tend to produce errors concentrated at specific error-prone points, which significantly impacts the accuracy of the generated code. To address this issue, we introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards these critical error-prone areas. This approach builds on Direct Preference Optimization, emphasizing accuracy in parts prone to errors. Additionally, we develop a method called Error-Point Identification, which constructs a dataset that targets these problematic points without requiring costly human annotations. Our experiments on benchmarks such as HumanEval(+), MBPP(+), and LiveCodeBench demonstrate that Focused-DPO significantly improves the precision and reliability of code generation, reducing common errors and enhancing overall code quality. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
Benchmarking Long-Context Language Models on Long Code Understanding
Jia Li | Xuyuan Guo | Lei Li | Kechi Zhang | Ge Li | Jia Li | Zhengwei Tao | Fang Liu | Chongyang Tao | Yuqi Zhu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jia Li | Xuyuan Guo | Lei Li | Kechi Zhang | Ge Li | Jia Li | Zhengwei Tao | Fang Liu | Chongyang Tao | Yuqi Zhu | Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LongCodeU from four aspects (8 tasks) to evaluate LCLMs’ long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LongCodeU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs’ capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K to 1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.
2024
HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
Kechi Zhang | Ge Li | Huangzhao Zhang | Zhi Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kechi Zhang | Ge Li | Huangzhao Zhang | Zhi Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Addressing the limitation of context length in large language models for code-related tasks is the primary focus of this paper. Existing LLMs are constrained by their pre-trained context lengths, leading to performance issues in handling long complex code sequences. Inspired by how human programmers navigate code, we introduce Hierarchical Rotary Position Embedding (HiRoPE), a novel approach that enhances the traditional rotary position embedding into a hierarchical format based on the hierarchical structure of source code. HiRoPE offers easy integration into existing LLMs without extra training costs. Our method is extensively evaluated with various LLMs, demonstrating stable performance in tasks such as language modeling and long code completion. We also introduce a new long code understanding task with real-world code projects, in hopes of promoting further development in this code-related field. Theoretically and experimentally, we find that HiRoPE also addresses the out-of-distribution issue in position encoding. Our HiRoPE significantly expands the context length capabilities of LLMs, enabling inference at lengths exponentially greater than the training length.
CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges
Kechi Zhang | Jia Li | Ge Li | Xianjie Shi | Zhi Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kechi Zhang | Jia Li | Ge Li | Xianjie Shi | Zhi Jin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have shown promise in automated code generation but typically excel only in simpler tasks such as generating standalone code units. However, real-world software development often involves complex code repositories with complex dependencies and extensive documentation. To enable LLMs to handle these realworld repo-level code generation, we present CodeAgent, a novel LLM-based agent framework that employs external tools for effective repo-level code generation. CodeAgent integrates five programming tools, enabling interaction with software artifacts for information retrieval, code implementation, and code testing. We implement four agent strategies to optimize these tools’ usage. To the best of our knowledge, CodeAgent is the first agent tool framework specifically for repo-level code generation. In order to measure the effectiveness of our method at the repository level, we have introduced a benchmark dataset CodAgentBench. The performance on this dataset shows a significant improvement brought by our method, with improvements of pass rate ranging from 2.0 to 15.8. Further tests on the HumanEval benchmark confirm CodeAgent’s adaptability and efficacy across various code generation tasks. Notably, CodeAgent outperforms commercial products like Github Copilot, showcasing superior accuracy and efficiency. These results demonstrate CodeAgent’s robust capabilities in code generation, highlighting its potential for real-world repo-level coding challenges.
2023
Self-Edit: Fault-Aware Code Editor for Code Generation
Kechi Zhang | Zhuo Li | Jia Li | Ge Li | Zhi Jin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kechi Zhang | Zhuo Li | Jia Li | Ge Li | Zhi Jin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89% on APPS-dev, 31% on APPS-test, and 48% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
Search
Fix author
Co-authors
- Zhi Jin 9
- Ge Li 8
- Yihong Dong 4
- Jia Li 4
- Xue Jiang 2
- Jia Li 2
- Xianjie Shi 2
- Zhengwei Tao 2
- Jingjing Xu 2
- Chenxin An 1
- Xunliang Cai 1
- Tianle Cai 1
- Rongyu Cao 1
- Xiancai Chen 1
- Jue Chen 1
- Jiaze Chen 1
- Wanli Gu 1
- Xuyuan Guo 1
- Yuanpeng He 1
- Zhenyu He 1
- Di He 1
- Fei Huang 1
- Wenpin Jiao 1
- Binhua Li 1
- Yongbin Li 1
- Chenjie Li 1
- Jia Li 1
- Lei Li 1
- Zhuo Li 1
- Huanyu Liu 1
- Yongfei Liu 1
- Fang Liu (刘芳) 1
- Yingwei MA 1
- Lili Mou 1
- Jiaru Qian 1
- Wei Shen 1
- Wenlei Shi 1
- Jing Su 1
- Yongding Tao 1
- Chongyang Tao 1
- Ziyu Wang 1
- Qingping Yang 1
- Xinyu Zhang 1
- Mengdi Zhang 1
- Huangzhao Zhang 1
- Jun Zhang 1
- Jielun Zhang 1
- Haiyan Zhao 1
- Zeyu Zhao 1
- Xiaojian Zhong 1
- Changzhi Zhou 1
- Hao Zhu 1
- Yuqi Zhu 1