Influence-based Online Experience Selection for Effective RLHF

Yifan Gong, Jing Yao, Xiting Wang, Xunlong Wang, Xiaoyuan Yi, Xing Xie


Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a crucial technique for aligning large language models (LLMs) with human preferences. However, existing RLHF methods face key challenges, including poor sample efficiency, high computational overhead, and slow convergence. Recent studies highlight the importance of data selection in RL, but how to effectively select the most beneficial experiences for RL training remains an open problem. Existing data selection methods for RL rely on heuristic metrics, failing to establish an interpretable connection between data and optimization objectives. To address this problem, we propose InfOES (Influence-based Online Experience Selection), a novel data selection method for RLHF that dynamically estimates the influence of individual training samples on policy optimization. By incorporating data attribution into the policy gradient, InfOES can identify and filter out detrimental samples on the fly, ensuring effective convergence toward alignment objectives. Our approach is compatible with various RL algorithms (e.g., PPO, GRPO, REINFORCE++). Extensive experiments demonstrate that InfOES significantly enhances training effectiveness, achieving superior alignment performance with fewer optimization steps.
Anthology ID:
2026.acl-long.2206
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47755–47771
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2206/
DOI:
Bibkey:
Cite (ACL):
Yifan Gong, Jing Yao, Xiting Wang, Xunlong Wang, Xiaoyuan Yi, and Xing Xie. 2026. Influence-based Online Experience Selection for Effective RLHF. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 47755–47771, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Influence-based Online Experience Selection for Effective RLHF (Gong et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2206.pdf
Checklist:
 2026.acl-long.2206.checklist.pdf