A Comprehensive Survey on Learning from Rewards for Large Language Models: Reward Models and Learning Strategies

Xiaobao Wu

doi:10.18653/v1/2025.findings-emnlp.970

A Comprehensive Survey on Learning from Rewards for Large Language Models: Reward Models and Learning Strategies

Abstract

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions.

Anthology ID:: 2025.findings-emnlp.970
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17847–17875
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.970/
DOI:: 10.18653/v1/2025.findings-emnlp.970
Bibkey:
Cite (ACL):: Xiaobao Wu. 2025. A Comprehensive Survey on Learning from Rewards for Large Language Models: Reward Models and Learning Strategies. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17847–17875, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Comprehensive Survey on Learning from Rewards for Large Language Models: Reward Models and Learning Strategies (Wu, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.970.pdf
Checklist:: 2025.findings-emnlp.970.checklist.pdf

PDF Cite Search Checklist Fix data