Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Kehua Feng; Keyan Ding; Tan Hongzhi; Kede Ma; Zhihua Wang; Shuangquan Guo; Cheng Yuzhou; Ge Sun; Guozhou Zheng; Qiang Zhang; Huajun Chen

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Kehua Feng, Keyan Ding, Tan Hongzhi, Kede Ma, Zhihua Wang, Shuangquan Guo, Cheng Yuzhou, Ge Sun, Guozhou Zheng, Qiang Zhang, Huajun Chen

Abstract

The past years have witnessed a proliferation of large language models (LLMs). Yet, reliable evaluation of LLMs is challenging due to the inaccuracy of standard metrics in human perception of text quality and the inefficiency in sampling informative test examples for human evaluation. This paper presents a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative input instructions, each of which maximizes the discrepancy of two LLMs’ reponses, which are subsequently subject to three-alternative forced choice by human subjects. The pairwise comparison results of multiple LLMs are then aggregated into a global ranking using the Elo rating system. We compare eight representative LLMs in terms of four skills: knowledge understanding, mathematical reasoning, writing, and coding. Experimental results show that the proposed method reliably achieves the “golden” ranking of LLMs with a minimum set of input instructions, which in turn reveal their relative strengths and weaknesses, and offers valuable insights for further LLM advancement.

Anthology ID:: 2025.acl-long.535
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10913–10947
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.535/
DOI:
Bibkey:
Cite (ACL):: Kehua Feng, Keyan Ding, Tan Hongzhi, Kede Ma, Zhihua Wang, Shuangquan Guo, Cheng Yuzhou, Ge Sun, Guozhou Zheng, Qiang Zhang, and Huajun Chen. 2025. Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10913–10947, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition (Feng et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.535.pdf

PDF Cite Search Fix data