2025
pdf
bib
abs
RLHF Algorithms Ranked: An Extensive Evaluation Across Diverse Tasks, Rewards, and Hyperparameters
Lucas Spangher
|
Rama Kumar Pasumarthi
|
Nick Masiewicki
|
William F. Arnold
|
Aditi Kaushal
|
Dale Johnson
|
Peter Grabowski
|
Eugene Ie
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) have demonstrated impressive text generation capabilities, yet their outputs often misalign with human preferences. To address this challenge, Reinforcement Learning from Human Feedback (RLHF) has become an essential component of modern LLM training pipelines. Although Proximal Policy Optimization (PPO) initially emerged as a favored RLHF strategy, its complexity and inefficiency have spurred the investigation of simpler alternatives. This work presents, to the authors’ knowledge, the most comprehensive benchmark to date of seventeen state-of-the-art RLHF algorithms. We evaluate these algorithms on two different benchmarks, OpenAI’s TL;DR Summarization and Anthropic’s Helpfulness / Harmlessness, with two different reward models a Gemma 2B Reward model and a Rules based reward model. We incorporate extensive hyperparameter sweeps for each algorithm. With this expanded analysis, we report consistently top-performing RLHF algorithms: IPO, DPO, Reinforce, GRPO, and Best-of-N, and list the highest performing hyperparameter combinations for each. This work aims to guide practitioners in selecting the most effective RLHF algorithm while promoting a culture of thorough and impartial benchmarking in the field.
pdf
bib
abs
A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting
Alexander Spangher
|
Tenghao Huang
|
Yiqin Huang
|
Lucas Spangher
|
Sewon Min
|
Mark Dredze
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Multi-document retrieval approaches often overlook the ways different retrievals complement each other when addressing complex queries. In this work, we study journalist source selection in news article writing and examine the discourse roles that different sources serve when paired together, finding that discourse function (not simply informational content) is an important component of source usage. Then, we introduce a novel IR task to benchmark how well language models can reason about this narrative process. We extract a journalist’s initial query and the sources they used from news articles and aim to recover the sources that support this query. We demonstrate that large language models (LLMs) can be employed in multi-step query planning, identifying informational gaps and enhancing retrieval performance, but current approaches to interleave queries fall short. By training auxiliary discourse planners and incorporating this information into LLMs, we enhance query planning, achieving a significant 5% improvement in precision and a 2% increase in F1 score over the previous SOTA, all while maintaining recall.
pdf
bib
abs
Chatbot Arena Estimate: towards a generalized performance benchmark for LLM capabilities
Lucas Spangher
|
Tianle Li
|
William F. Arnold
|
Nick Masiewicki
|
Xerxes Dotiwalla
|
Rama Kumar Pasumarthi
|
Peter Grabowski
|
Eugene Ie
|
Daniel Gruhl
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
In industrial LLM development, evaluating large language models (LLMs) is critical for tasks like benchmarking internal models and detecting regressions during fine-tuning, but existing benchmark aggregation methods, such as Elo-based systems, can be resource-intensive, public facing, and time-consuming. Here, we describe Chatbot Arena Estimate (CAE), a practical framework for aggregating performance across diverse benchmarks. The framework, developed and widely adopted within our organization, addresses the need for quick, accurate, and cost-efficient evaluations of LLMs. CAE generates two primary metrics: a “Goodness” score (answer accuracy) and a “Fastness” score (cost or queries per second, QPS). These metrics allow for model ranking both overall and within specific subdomains, enabling informed decisions during model iteration and deployment. We demonstrate CAE’s effectiveness by comparing it with existing benchmarks, including the full Chatbot Arena and the MMLU leaderboard. Notably, our approach achieves higher Pearson correlation with Chatbot Arena Elo scores than MMLU’s correlation with Chatbot Arena Elo scores, validating its reliability for real-world LLM evaluation.