Influence-based Online Experience Selection for Effective RLHF
Yifan Gong, Jing Yao, Xiting Wang, Xunlong Wang, Xiaoyuan Yi, Xing Xie
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a crucial technique for aligning large language models (LLMs) with human preferences. However, existing RLHF methods face key challenges, including poor sample efficiency, high computational overhead, and slow convergence. Recent studies highlight the importance of data selection in RL, but how to effectively select the most beneficial experiences for RL training remains an open problem. Existing data selection methods for RL rely on heuristic metrics, failing to establish an interpretable connection between data and optimization objectives. To address this problem, we propose InfOES (Influence-based Online Experience Selection), a novel data selection method for RLHF that dynamically estimates the influence of individual training samples on policy optimization. By incorporating data attribution into the policy gradient, InfOES can identify and filter out detrimental samples on the fly, ensuring effective convergence toward alignment objectives. Our approach is compatible with various RL algorithms (e.g., PPO, GRPO, REINFORCE++). Extensive experiments demonstrate that InfOES significantly enhances training effectiveness, achieving superior alignment performance with fewer optimization steps.- Anthology ID:
- 2026.acl-long.2206
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 47755–47771
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.2206/
- DOI:
- Cite (ACL):
- Yifan Gong, Jing Yao, Xiting Wang, Xunlong Wang, Xiaoyuan Yi, and Xing Xie. 2026. Influence-based Online Experience Selection for Effective RLHF. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 47755–47771, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Influence-based Online Experience Selection for Effective RLHF (Gong et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.2206.pdf