Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Finnur Ágúst Ingimundarson, Steinunn Rut Fridriksdottir, Bjarki Ármannsson, Iris Nowenstein, Steinþór Steingrímsson


Abstract
This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.
Anthology ID:
2026.lrec-main.369
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
4702–4715
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.369/
DOI:
Bibkey:
Cite (ACL):
Finnur Ágúst Ingimundarson, Steinunn Rut Fridriksdottir, Bjarki Ármannsson, Iris Nowenstein, and Steinþór Steingrímsson. 2026. Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic. International Conference on Language Resources and Evaluation, main:4702–4715.
Cite (Informal):
Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic (Ingimundarson et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.369.pdf