Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Clemencia Siro; Pourya Aliannejadi; Mohammad Aliannejadi

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Clemencia Siro, Pourya Aliannejadi, Mohammad Aliannejadi

Abstract

Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and use their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs—consistent within models but fragmented across them—and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.

Anthology ID:: 2026.findings-eacl.335
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6371–6389
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.335/
DOI:
Bibkey:
Cite (ACL):: Clemencia Siro, Pourya Aliannejadi, and Mohammad Aliannejadi. 2026. Learning to Judge: LLMs Designing and Applying Evaluation Rubrics. In Findings of the Association for Computational Linguistics: EACL 2026, pages 6371–6389, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Learning to Judge: LLMs Designing and Applying Evaluation Rubrics (Siro et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.335.pdf
Checklist:: 2026.findings-eacl.335.checklist.pdf

PDF Cite Search Checklist Fix data