Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint
Zhipeng Chen, Kun Zhou, Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen
Abstract
Reinforcement learning (RL) has been widely used in training large language models (LLMs) for preventing unexpected outputs, e.g., reducing harmfulness and errors. However, existing RL methods mainly adopt instance-level reward, which cannot provide fine-grained supervision for complex reasoning tasks. As a result, the RL training cannot be fully aware of the specific part or step that actually leads to the incorrectness in model response. To address it, we propose a new RL method named RLMEC that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, which can produce token-level supervision for RL training. Based 0on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And these two objectives focus on the revision of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. Experiment results on 8 tasks have demonstrated the effectiveness of our approach. Our code and data will be publicly released.- Anthology ID:
- 2024.findings-acl.338
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5694–5711
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.338/
- DOI:
- 10.18653/v1/2024.findings-acl.338
- Cite (ACL):
- Zhipeng Chen, Kun Zhou, Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. 2024. Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint. In Findings of the Association for Computational Linguistics: ACL 2024, pages 5694–5711, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint (Chen et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.338.pdf