Direct Judgement Preference Optimization
PeiFeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty
Abstract
To meet the increasing need for timely and accurate evaluation of large language model (LLM) responses, training LLM-as-judges to evaluate and critique other model responses has emerged as a popular paradigm. However, existing judge models are largely trained with supervised finetuning (SFT) on small data scales to perform limited types of evaluation tasks, fundamentally limiting generalization.To meet the need for strong, generalized judge models, we explore training foundational judge models at large data scales (680K) with direct preference optimization (DPO). Using four training tasks, we form three types of DPO preference pairs targeting different aspects of evaluation: Generating meaningful critiques, making accurate judgements, and understanding what comprises good and bad responses. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and evaluate on a comprehensive suite of 13 benchmarks (7 pairwise, 4 single rating, and 2 classification). Our models achieve the best aggregate performance, with even our 8B model outperforming GPT-4o in pairwise benchmarks. Further analysis shows that our judge models produce factual and actionable critiques and serve as strong foundational judges for continued finetuning.- Anthology ID:
- 2025.emnlp-main.103
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1979–2009
- Language:
- URL:
- https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.103/
- DOI:
- Cite (ACL):
- PeiFeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, and Shafiq Joty. 2025. Direct Judgement Preference Optimization. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1979–2009, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Direct Judgement Preference Optimization (Wang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.103.pdf