Finnur Ágúst Ingimundarson
2026
Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
Finnur Ágúst Ingimundarson | Steinunn Rut Fridriksdottir | Bjarki Ármannsson | Iris Nowenstein | Steinþór Steingrímsson
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Finnur Ágúst Ingimundarson | Steinunn Rut Fridriksdottir | Bjarki Ármannsson | Iris Nowenstein | Steinþór Steingrímsson
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.
2025
An Icelandic Linguistic Benchmark for Large Language Models
Bjarki Ármannsson | Finnur Ágúst Ingimundarson | Einar Freyr Sigurðsson
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Bjarki Ármannsson | Finnur Ágúst Ingimundarson | Einar Freyr Sigurðsson
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
This paper introduces a linguistic benchmark for Icelandic-language LLMs, the first of its kind manually constructed by native speakers. We report on the scores obtained by current state-of-the-art models, which indicate room for improvement, and discuss the theoretical problems involved in creating such a benchmark and scoring a model’s performance.