Dale Johnson

2025

pdf bib abs
RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters
Lucas Spangher | Rama Kumar Pasumarthi | Nick Masiewicki | William F. Arnold | Aditi Kaushal | Dale Johnson | Peter Grabowski | Eugene Ie
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large Language Models (LLMs) have demonstrated impressive text generation capabilities, yet their outputs often misalign with human preferences. To address this challenge, Reinforcement Learning from Human Feedback (RLHF) has become an essential component of modern LLM training pipelines. Although Proximal Policy Optimization (PPO) initially emerged as a favored RLHF strategy, its complexity and inefficiency have spurred the investigation of simpler alternatives. This work presents, to the authors’ knowledge, the most comprehensive benchmark to date of seventeen state-of-the-art RLHF algorithms. We evaluate these algorithms on two different benchmarks, OpenAI’s TL;DR Summarization and Anthropic’s Helpfulness / Harmlessness, with two different reward models a Gemma 2B Reward model and a Rules based reward model. We incorporate extensive hyperparameter sweeps for each algorithm. With this expanded analysis, we report consistently top-performing RLHF algorithms: IPO, DPO, Reinforce, GRPO, and Best-of-N, and list the highest performing hyperparameter combinations for each. This work aims to guide practitioners in selecting the most effective RLHF algorithm while promoting a culture of thorough and impartial benchmarking in the field.

Dale Johnson

2025

1986

Co-authors

Venues