Zuhao Yang
2025
QAEval: Mixture of Evaluators for Question-Answering Task Evaluation
Tan Yue
|
Rui Mao
|
Xuzhao Shi
|
Shuo Zhan
|
Zuhao Yang
|
Dongyan Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Question answering (QA) tasks serve as a key benchmark for evaluating generation systems. Traditional rule-based metrics, such as accuracy and relaxed-accuracy, struggle with open-ended and unstructured responses. LLM-based evaluation methods offer greater flexibility but suffer from sensitivity to instructions, robustness issues, and high computational costs. To overcome these challenges, we introduce QAEval, a hybrid framework combining rule-based reliability with LLM-based adaptability. QAEval utilizes two high-quality datasets: QAExtract for short-answer extraction and QAScore for scoring model training. By integrating a Mixture of Evaluators model with Dynamic Load Balancing Optimization, QAEval enables accurate, cost-effective QA evaluation. Experimental results show it outperforms models like GPT-4o and Claude-3, achieving 92.3% accuracy with only 0.6B parameters.
Evaluating Text Generation Quality Using Spectral Distances of Surprisal
Zhichen Liu
|
Yongyuan Li
|
Yang Xu
|
Yu Wang
|
Yingfang Yuan
|
Zuhao Yang
Findings of the Association for Computational Linguistics: EMNLP 2025
We propose a novel automatic evaluation metric for open-ended text generation, which is a substantial improvement of the recently developed method, Fourier analysis of cross-entropy (FACE), hence, FACE-2. FACE-2 is a psycholinguistically inspired metric that extracts the dynamic patterns (spectrum) of text surprisal. Examined with open-ended text generation tasks, FACE-2 significantly outperforms a broad set of baseline metrics in revealing the model scaling effect, which scales up to models of 70B parameters, while many other existing metrics fail to capture this effect. We have also confirmed the advantage of FACE-2 in producing stronger agreement with human preferences from a large human-annotated dataset. We advocate for including metrics that mine the dynamics of likelihood in evaluating open-ended text generation, which covers broader aspects of human language than only using static likelihood-based or semantic-based metrics. Code repository: https://github.com/CLCS-SUSTech/FACEScore.