WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs
Þórunn Arnardóttir, Elías Bjartur Einarsson, Garðar Ingvarsson Juto, Þorvaldur Páll Helgason, Hafsteinn Einarsson
Abstract
This paper presents WikiQA-IS, a novel question-answering dataset focusing on Icelandic culture and history, along with an automated pipeline for dataset generation and evaluation. Leveraging GPT-4 to create questions and answers based on Icelandic Wikipedia articles and news sources, we produced a high-quality corpus of 2,000 question-answer pairs. We introduce an automatic evaluation method using GPT-4o as a judge, which shows strong agreement with human evaluations. Our benchmark reveals varying performances across different language models, with closed-source models generally outperforming open-weights alternatives. This work contributes a resource for evaluating language models’ knowledge of Icelandic culture and offers a replicable framework for creating similar datasets in other cultural contexts.- Anthology ID:
- 2025.resourceful-1.13
- Volume:
- Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
- Month:
- March
- Year:
- 2025
- Address:
- Tallinn, Estonia
- Editors:
- Špela Arhar Holdt, Nikolai Ilinykh, Barbara Scalvini, Micaella Bruton, Iben Nyholm Debess, Crina Madalina Tudor
- Venues:
- RESOURCEFUL | WS
- SIG:
- Publisher:
- University of Tartu Library, Estonia
- Note:
- Pages:
- 64–73
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.13/
- DOI:
- Cite (ACL):
- Þórunn Arnardóttir, Elías Bjartur Einarsson, Garðar Ingvarsson Juto, Þorvaldur Páll Helgason, and Hafsteinn Einarsson. 2025. WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), pages 64–73, Tallinn, Estonia. University of Tartu Library, Estonia.
- Cite (Informal):
- WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs (Þórunn Arnardóttir et al., RESOURCEFUL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.resourceful-1.13.pdf