@inproceedings{san-joaquin-etal-2026-scorecard,
title = "Scorecard of {AI} Benchmark Quality",
author = "San Joaquin, Ayrton and
Gipi{\v{s}}kis, Rokas and
Chin, Ze Shen",
editor = "Akhtar, Mubashara and
Batzner, Jan and
Choshen, Leshem and
Ghosh, Avijit and
Gohar, Usman and
Mickel, Jennifer and
Pant, Ichhya and
Talat, Zeerak and
Lin, Michelle",
booktitle = "Proceedings of the Workshop on Evaluating Evaluations ({E}val{E}val)",
month = jul,
year = "2026",
address = "San Diego, CA",
publisher = "Association for Computational Linguistics",
url = "https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.25/",
pages = "128--160",
ISBN = "979-8-89176-429-3",
abstract = "Effective AI risk assessment relies on the quality of evaluations. Currently, there are large quality differences, such as in construct validity and annotation, between existing benchmarks. In this work, we propose a quality scorecard for benchmarks designed to make this diversity easier to navigate. The scorecard employs two main components: dimensions, which provide granular scores of an evaluation under that dimension, and classifications, which correspond to concrete use-cases ranging from research to post-deployment. By establishing a common language and objective methods, this framework aims to aid in transparency and raise the baseline quality of benchmarks used across the ecosystem."
}Markdown (Informal)
[Scorecard of AI Benchmark Quality](https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.25/) (San Joaquin et al., EvalEval 2026)
ACL
- Ayrton San Joaquin, Rokas Gipiškis, and Ze Shen Chin. 2026. Scorecard of AI Benchmark Quality. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 128–160, San Diego, CA. Association for Computational Linguistics.