Taywon Min
2025
Understanding Impact of Human Feedback via Influence Functions
Taywon Min
|
Haeone Lee
|
Yongchan Kwon
|
Kimin Lee
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in refining their strategies to better align with expert feedback. By quantifying the impact of human feedback, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF.