Austin Xu
2025
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu
|
Srijan Bansal
|
Yifei Ming
|
Semih Yavuz
|
Shafiq Joty
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models—LLMs finetuned to specialize in assessing and critiquing model outputs—have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings—those where external information is used as context to generate an output—is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 7 general purpose models, reveals that the contextual information and assessment criteria present a significant challenge to even state-of-the-art models. For example, o1, the best-performing model, barely reaches 55% consistent accuracy.
Direct Judgement Preference Optimization
PeiFeng Wang
|
Austin Xu
|
Yilun Zhou
|
Caiming Xiong
|
Shafiq Joty
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
To meet the increasing need for timely and accurate evaluation of large language model (LLM) responses, training LLM-as-judges to evaluate and critique other model responses has emerged as a popular paradigm. However, existing judge models are largely trained with supervised finetuning (SFT) on small data scales to perform limited types of evaluation tasks, fundamentally limiting generalization.To meet the need for strong, generalized judge models, we explore training foundational judge models at large data scales (680K) with direct preference optimization (DPO). Using four training tasks, we form three types of DPO preference pairs targeting different aspects of evaluation: Generating meaningful critiques, making accurate judgements, and understanding what comprises good and bad responses. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and evaluate on a comprehensive suite of 13 benchmarks (7 pairwise, 4 single rating, and 2 classification). Our models achieve the best aggregate performance, with even our 8B model outperforming GPT-4o in pairwise benchmarks. Further analysis shows that our judge models produce factual and actionable critiques and serve as strong foundational judges for continued finetuning.
Search
Fix author
Co-authors
- Shafiq Joty 2
- Srijan Bansal 1
- Yifei Ming 1
- Peifeng Wang 1
- Caiming Xiong 1
- show all...