Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics

Ling-I Wu; Weijie Wu; Minyu Chen; Jianxin Xue; Guoqiang Li

Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics

Ling-I Wu, Weijie Wu, Minyu Chen, Jianxin Xue, Guoqiang Li

Abstract

Large language models (LLMs) are increasingly used as evaluators in natural language generation tasks, offering advantages in scalability and interpretability over traditional evaluation methods. However, existing LLM-based evaluations often suffer from biases and misalignment, particularly in domain-specific tasks, due to limited functional understanding and knowledge gaps. To address these challenges, we first investigate the relationship between an LLM-based evaluator’s familiarity with the target task and its evaluation performance. We then introduce the Co-Eval framework, which leverages a criteria planner model and optimized machine metrics to enhance the scalability and fairness of LLM-based evaluation. Experimental results on both general and domain-specific tasks demonstrate that Co-Eval reduces biases, achieving up to a 0.4903 reduction in self-preference bias, and improves alignment with human preferences, with gains of up to 0.324 in Spearman correlation.

Anthology ID:: 2025.emnlp-main.1307
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25765–25787
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1307/
DOI:
Bibkey:
Cite (ACL):: Ling-I Wu, Weijie Wu, Minyu Chen, Jianxin Xue, and Guoqiang Li. 2025. Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25765–25787, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics (Wu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1307.pdf
Checklist:: 2025.emnlp-main.1307.checklist.pdf

PDF Cite Search Checklist Fix data