BenchNavigator: A Discovery Interface for Comparing LLM Benchmarks

Anna Sokol; Inge Vejsbjerg; Elizabeth M. Daly; David Piorkowski; Michael Hind; Nuno Moniz; Nitesh V. Chawla

BenchNavigator: A Discovery Interface for Comparing LLM Benchmarks

Anna Sokol, Inge Vejsbjerg, Elizabeth M. Daly, David Piorkowski, Michael Hind, Nuno Moniz, Nitesh V. Chawla

Abstract

Evaluating large language models (LLMs) requires selecting benchmarks that fit the intended use case. However, the rapid growth of benchmarks has made discovery and comparison difficult, because practitioners must assemble information across papers, repositories, and dataset cards with heterogeneous metadata, inconsistent terminology, and uneven documentation. Prior work improves individual benchmark documentation and quality assessment, but does not provide a uniform way to compare benchmarks during discovery. We survey practitioners, analyze multi-source benchmark metadata, and identify the fields needed for effective benchmark discovery. We introduce BenchNavigator, a prototype that organizes heterogeneous metadata into a coherent, provenance-preserving interface aligned with practitioner priorities. Our results show that benchmark metadata can be presented in a comparable form without imposing new reporting burdens on benchmark producers. We frame this contribution as discovery infrastructure, not as a method for scoring benchmark quality or replacing contextual evaluation.

Anthology ID:: 2026.evaleval-1.29
Volume:: Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Month:: July
Year:: 2026
Address:: San Diego, CA
Editors:: Mubashara Akhtar, Jan Batzner, Leshem Choshen, Avijit Ghosh, Usman Gohar, Jennifer Mickel, Ichhya Pant, Zeerak Talat, Michelle Lin
Venues:: EvalEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 174–200
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.29/
DOI:
Bibkey:
Cite (ACL):: Anna Sokol, Inge Vejsbjerg, Elizabeth M. Daly, David Piorkowski, Michael Hind, Nuno Moniz, and Nitesh V. Chawla. 2026. BenchNavigator: A Discovery Interface for Comparing LLM Benchmarks. In Proceedings of the Workshop on Evaluating Evaluations (EvalEval), pages 174–200, San Diego, CA. Association for Computational Linguistics.
Cite (Informal):: BenchNavigator: A Discovery Interface for Comparing LLM Benchmarks (Sokol et al., EvalEval 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.evaleval-1.29.pdf

PDF Cite Search Fix data