Process-Supervised Reinforcement Learning for Code Generation

Yufan Ye, Ting Zhang, Wenbin Jiang, Hua Huang


Abstract
Existing reinforcement learning (RL) strategies based on outcome supervision have proven effective in enhancing the performance of large language models (LLMs) for code generation. While reinforcement learning based on process supervision shows great potential in multi-step reasoning tasks, its effectiveness in the field of code generation still lacks sufficient exploration and verification. The primary obstacle stems from the resource-intensive nature of constructing a high-quality process-supervised reward dataset, which requires substantial human expertise and computational resources. To overcome this challenge, this paper proposes a “mutation/refactoring-execution verification” strategy. Specifically, the teacher model is used to mutate and refactor the statement lines or blocks, and the execution results of the compiler are used to automatically label them, thus generating a process-supervised reward dataset. Based on this dataset, we have carried out a series of RL experiments. The experimental results show that, compared with the method relying only on outcome supervision, reinforcement learning based on process supervision performs better in handling complex code generation tasks. In addition, this paper for the first time confirms the advantages of the Direct Preference Optimization (DPO) method in the RL task of code generation based on process supervision, providing new ideas and directions for code generation research.
Anthology ID:
2025.emnlp-main.719
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14224–14237
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.719/
DOI:
Bibkey:
Cite (ACL):
Yufan Ye, Ting Zhang, Wenbin Jiang, and Hua Huang. 2025. Process-Supervised Reinforcement Learning for Code Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14224–14237, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Process-Supervised Reinforcement Learning for Code Generation (Ye et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.719.pdf
Checklist:
 2025.emnlp-main.719.checklist.pdf