(Towards) Scalable Reliable Automated Evaluation with Large Language Models

Bertil Braun; Martin Forell

(Towards) Scalable Reliable Automated Evaluation with Large Language Models

Abstract

Evaluating the quality and relevance of textual outputs from Large Language Models (LLMs) remains challenging and resource-intensive.Existing automated metrics often fail to capture the complexity and variability inherent in LLM-generated outputs.Moreover, these metrics typically rely on explicit reference standards, limiting their use mostly to domains with objective benchmarks.This work introduces a novel evaluation framework designed to approximate expert-level assessments of LLM-generated content.The proposed method employs pairwise comparisons of outputs by multiple LLMs, reducing biases from individual models.An Elo rating system is used to generate stable and interpretable rankings.Adjustable agreement thresholds—from full unanimity to majority voting—allow flexible control over evaluation confidence and coverage.The method’s effectiveness is demonstrated through evaluating competency profiles extracted from scientific abstracts.Preliminary results show that automatically derived rankings correlate well with expert judgments, significantly reducing the need for extensive human intervention.By offering a scalable, consistent, and domain-agnostic evaluation layer, the framework supports more efficient and reliable quality assessments of LLM outputs across diverse applications.

Anthology ID:: 2025.gem-1.28
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 320–336
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.28/
DOI:
Bibkey:
Cite (ACL):: Bertil Braun and Martin Forell. 2025. (Towards) Scalable Reliable Automated Evaluation with Large Language Models. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 320–336, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: (Towards) Scalable Reliable Automated Evaluation with Large Language Models (Braun & Forell, GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.28.pdf

PDF Cite Search Fix data