Hongli Zhou
2025
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
Hui Huang
|
Xingyuan Bu
|
Hongli Zhou
|
Yingqi Qu
|
Jing Liu
|
Muyun Yang
|
Bing Xu
|
Tiejun Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.
2024
Mitigating the Bias of Large Language Model Evaluation
Hongli Zhou
|
Hui Huang
|
Yunfei Long
|
Bing Xu
|
Conghui Zhu
|
Hailong Cao
|
Muyun Yang
|
Tiejun Zhao
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Recently, there has been a trend of evaluating the Large Language Model (LLM) quality in the flavor of LLM-as-a-Judge, namely leveraging another LLM to evaluate the current output quality. However, existing judges are proven to be biased, namely they would favor answers which present better superficial quality (such as verbosity, fluency) while ignoring the instruction following ability. In this work, we propose systematic research about the bias of LLM-as-a-Judge. Specifically, for closed-source judge models, we apply calibration to mitigate the significance of superficial quality, both on probability level and prompt level. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality. We apply our methods on the bias evaluation benchmark, and experiment results show our methods mitigate the bias by a large margin while maintaining a satisfactory evaluation accuracy.
Search
Fix author
Co-authors
- Hui Huang 2
- Bing Xu 2
- Muyun Yang (杨沐昀) 2
- Tiejun Zhao (赵铁军) 2
- Xingyuan Bu 1
- show all...