Di Huang
2026
QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization
Changxin Ke | Rui Zhang | Jiaming Guo | Yuanbo Wen | Li Ding | Shuo Wang | Xuyuan Zhu | Xiong Peng | Di Huang | Zidong Du | Xing Hu | Qi Guo | Yunji Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Changxin Ke | Rui Zhang | Jiaming Guo | Yuanbo Wen | Li Ding | Shuo Wang | Xuyuan Zhu | Xiong Peng | Di Huang | Zidong Du | Xing Hu | Qi Guo | Yunji Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) achieve strong program repair performance but often suffer from over-editing, where excessive modifications overwrite correct code and hinder bug localization. We systematically quantify its impact and introduce precise repair task, which maximizes reuse of correct code while fixing only buggy parts. Building on this insight, we propose PRepair, a framework that mitigates over-editing and improves repair accuracy. PRepair has two components: Self-Breaking, which generates diverse buggy programs via controlled bug injection and min–max sampling, and Self-Repairing, which trains models with Edit-Aware Group Relative Policy Optimization (EA-GRPO) using an edit-aware reward to encourage minimal yet correct edits. Experiments show that PRepair improves repair precision by up to 31.4% under fix1@1, a metric that jointly considers repair correctness and extent, and significantly increases decoding throughput when combined with speculative editing, demonstrating its potential for precise and practical code repair.
Rhombus: Incentivizing Coordination in Parallel Thinking through Reinforcement Learning
Ziyuan Nan | Qi Yi | Di Huang | Yutong Wu | Guanhua Huang | Xue Gong | Kejiao Li | Yuhao Jiang | Chenchen Zhang | Zenan Xu | Xing Hu | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Ziyuan Nan | Qi Yi | Di Huang | Yutong Wu | Guanhua Huang | Xue Gong | Kejiao Li | Yuhao Jiang | Chenchen Zhang | Zenan Xu | Xing Hu | Bo Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Parallel thinking offers a promising avenue for scaling test-time compute in Large Language Models (LLMs), enabling them to explore diverse solution paths simultaneously before aggregating them into a final answer. However, coordinating the exploration and aggregation stages remains challenging, as simple aggregation techniques often incur information loss, failing to preserve the subtle, decision-relevant signals generated during exploration. To overcome this, we propose Rhombus, a parallel thinking framework that explicitly incentivizes coordination between components via end-to-end reinforcement learning. Rhombus employs multiple parallel Proposers to generate compact, decision-focused reasoning cues and a central Synthesizer to integrate them into final predictions, utilizing co-training under a shared task reward to align their interaction. Across challenging mathematical reasoning benchmarks, Rhombus improves accuracy by 6.0% over long chain-of-thought baselines while reducing wall-clock latency by 39.4% under matched token budgets. Our work demonstrates that explicit communication optimization is essential for realizing the accuracy and efficiency gains of parallel reasoning.
Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER) — a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability. Codes are available at https://anonymous.4open.science/r/AER-ACL .