Rokas Gipiškis
2026
Evaluation Cards for XAI Metrics
Rokas Gipiškis | Olga Kurasova
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Rokas Gipiškis | Olga Kurasova
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.
Scorecard of AI Benchmark Quality
Ayrton San Joaquin | Rokas Gipiškis | Ze Shen Chin
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Ayrton San Joaquin | Rokas Gipiškis | Ze Shen Chin
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Effective AI risk assessment relies on the quality of evaluations. Currently, there are large quality differences, such as in construct validity and annotation, between existing benchmarks. In this work, we propose a quality scorecard for benchmarks designed to make this diversity easier to navigate. The scorecard employs two main components: dimensions, which provide granular scores of an evaluation under that dimension, and classifications, which correspond to concrete use-cases ranging from research to post-deployment. By establishing a common language and objective methods, this framework aims to aid in transparency and raise the baseline quality of benchmarks used across the ecosystem.