Statistical Multicriteria Evaluation of LLM-Generated Text

Esteban Garces Arias; Hannah Blocher; Julian Rodemann; Matthias Aßenmacher; Christoph Jansen

Statistical Multicriteria Evaluation of LLM-Generated Text

Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias Assenmacher, Christoph Jansen

Abstract

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.

Anthology ID:: 2025.inlg-main.20
Volume:: Proceedings of the 18th International Natural Language Generation Conference
Month:: October
Year:: 2025
Address:: Hanoi, Vietnam
Editors:: Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 338–351
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.inlg-main.20/
DOI:
Bibkey:
Cite (ACL):: Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias Assenmacher, and Christoph Jansen. 2025. Statistical Multicriteria Evaluation of LLM-Generated Text. In Proceedings of the 18th International Natural Language Generation Conference, pages 338–351, Hanoi, Vietnam. Association for Computational Linguistics.
Cite (Informal):: Statistical Multicriteria Evaluation of LLM-Generated Text (Garces Arias et al., INLG 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.inlg-main.20.pdf

PDF Cite Search Fix data