RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters

Lucas Spangher; Rama Kumar Pasumarthi; Nick Masiewicki; William F. Arnold; Aditi Kaushal; Dale Johnson; Peter Grabowski; Eugene Ie

RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters

Lucas Spangher, Rama Kumar Pasumarthi, Nick Masiewicki, William F. Arnold, Aditi Kaushal, Dale Johnson, Peter Grabowski, Eugene Ie

Abstract

Large Language Models (LLMs) have demonstrated impressive text generation capabilities, yet their outputs often misalign with human preferences. To address this challenge, Reinforcement Learning from Human Feedback (RLHF) has become an essential component of modern LLM training pipelines. Although Proximal Policy Optimization (PPO) initially emerged as a favored RLHF strategy, its complexity and inefficiency have spurred the investigation of simpler alternatives. This work presents, to the authors’ knowledge, the most comprehensive benchmark to date of seventeen state-of-the-art RLHF algorithms. We evaluate these algorithms on two different benchmarks, OpenAI’s TL;DR Summarization and Anthropic’s Helpfulness / Harmlessness, with two different reward models a Gemma 2B Reward model and a Rules based reward model. We incorporate extensive hyperparameter sweeps for each algorithm. With this expanded analysis, we report consistently top-performing RLHF algorithms: IPO, DPO, Reinforce, GRPO, and Best-of-N, and list the highest performing hyperparameter combinations for each. This work aims to guide practitioners in selecting the most effective RLHF algorithm while promoting a culture of thorough and impartial benchmarking in the field.

Anthology ID:: 2025.emnlp-industry.35
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2025
Address:: Suzhou (China)
Editors:: Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 518–529
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.35/
DOI:
Bibkey:
Cite (ACL):: Lucas Spangher, Rama Kumar Pasumarthi, Nick Masiewicki, William F. Arnold, Aditi Kaushal, Dale Johnson, Peter Grabowski, and Eugene Ie. 2025. RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 518–529, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):: RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters (Spangher et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.35.pdf

PDF Cite Search Fix data