Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque

David Ponce; Harritxu Gete; Thierry Etchegoyhen; Irune Zubiaga; Aitor Soroa

Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque

David Ponce, Harritxu Gete, Thierry Etchegoyhen, Irune Zubiaga, Aitor Soroa

Abstract

Evaluating the quality of answers to a given instruction is a demanding and time-consuming task, limiting the scalability of human assessment. Large language models (LLMs) have been proposed as automatic judges to reduce this effort, but their reliability in low-resource contexts remains uncertain. Additionally, the premise that humans are reliable judges of fine-grained response quality needs to be assessed as well, if correlation with automated judges on this task is to be considered a gold standard. In this work, we investigate the performance of various LLM-as-a-judge in a low-resource scenario, namely Basque, and evaluate its correlation with human judgements. Additionally, we measure the agreement between human judgments themselves, to assess their viability as a valid reference. To perform our experiments, we translated and manually post-edited the Just-Eval benchmark, a suite of benchmarks tackling fine-grained aspects of response quality. We also extend the evaluation with a novel category aimed at judging both language consistency and grammaticality. Our results show that state of the art models exhibit fairly poor correlations with humans and amongst themselves, calling for the development of dedicated LLM-as-a-judge models for this language.

Anthology ID:: 2026.lrec-main.19
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 281–298
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.19/
DOI:
Bibkey:
Cite (ACL):: David Ponce, Harritxu Gete, Thierry Etchegoyhen, Irune Zubiaga, and Aitor Soroa. 2026. Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque. International Conference on Language Resources and Evaluation, main:281–298.
Cite (Informal):: Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque (Ponce et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.19.pdf

PDF Cite Search Fix data