Challenges in Trustworthy Human Evaluation of Chatbots

Wenting Zhao, Alexander M Rush, Tanya Goyal


Abstract
Recently, open community-driven platforms like Chatbot Arena that collect user preference data from site visitors have gained reputation as trustworthy publicly available benchmarks for LLM performance. While gold standard, it is often tricky to implement the required guardrails to collect high-quality annotations from humans. In this paper, we demonstrate that different source of bad annotations, both malicious and otherwise, can corrupt the reliability of open leaderboard rankings. In particular, we show that only 10% of poor quality votes by apathetic (site visitors not appropriately incentivized to give correct votes) or adversarial (bad actors seeking to inflate the ranking of a target model) annotators can change the rankings of models by up to 5 places on the leaderboard. Finally, we discuss open challenges in ensuring high quality human annotations.
Anthology ID:
2025.findings-naacl.186
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3359–3365
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.186/
DOI:
Bibkey:
Cite (ACL):
Wenting Zhao, Alexander M Rush, and Tanya Goyal. 2025. Challenges in Trustworthy Human Evaluation of Chatbots. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3359–3365, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Challenges in Trustworthy Human Evaluation of Chatbots (Zhao et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.186.pdf