Qianou Ma
2025
SPHERE: An Evaluation Card for Human-AI Systems
Dora Zhao
|
Qianou Ma
|
Xinran Zhao
|
Chenglei Si
|
Chenyang Yang
|
Ryan Louie
|
Ehud Reiter
|
Diyi Yang
|
Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2025
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
Search
Fix author
Co-authors
- Ryan Louie 1
- Ehud Reiter 1
- Chenglei Si 1
- Tongshuang Wu 1
- Chenyang Yang 1
- show all...