Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

Zhipeng Chen, Kun Zhou, Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen


Abstract
Reinforcement learning (RL) has been widely used in training large language models (LLMs) for preventing unexpected outputs, e.g., reducing harmfulness and errors. However, existing RL methods mainly adopt instance-level reward, which cannot provide fine-grained supervision for complex reasoning tasks. As a result, the RL training cannot be fully aware of the specific part or step that actually leads to the incorrectness in model response. To address it, we propose a new RL method named RLMEC that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, which can produce token-level supervision for RL training. Based 0on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And these two objectives focus on the revision of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. Experiment results on 8 tasks have demonstrated the effectiveness of our approach. Our code and data will be publicly released.
Anthology ID:
2024.findings-acl.338
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5694–5711
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.338/
DOI:
10.18653/v1/2024.findings-acl.338
Bibkey:
Cite (ACL):
Zhipeng Chen, Kun Zhou, Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. 2024. Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint. In Findings of the Association for Computational Linguistics: ACL 2024, pages 5694–5711, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint (Chen et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.338.pdf