Never Truly Out of Fashion: A Retrospective Look at Evaluation in NLG

Patrícia Schmidtová; Saad Mahamood; Ondřej Dušek

Never Truly Out of Fashion: A Retrospective Look at Evaluation in NLG

Patrícia Schmidtová, Saad Mahamood, Ondřej Dušek

Abstract

Human evaluation (HE) remains the gold standard for assessing natural language generation (NLG) systems, yet automatic metrics are cheaper and faster, creating mounting pressure to skip it. We ask how evaluation practices have changed as NLG research scales. We analyse 24,291 papers from the ACL Anthology (1952–2025) through regular-expression-powered keyword analysis. Before 1990, the majority of NLG papers reported no evaluation at all; today, evaluation is near-universal and HE has held broadly stable over the past decade – it has not collapsed. However, large language model (LLM) judges (referred to as LLM-as-a-judge) have emerged rapidly since 2023, and while they currently serve predominantly as a complement rather than a full substitute for human evaluation, a substantial share of papers already use LLM judges without any human validation. Faithfulness has become the fastest-rising evaluation criterion since 2020, coming back into fashion after almost 15 years of decline, tracking the prominence of hallucination research, while criteria such as grammaticality and fluency are receding, suggesting these qualities may increasingly be taken for granted as model outputs improve. Our findings provide a longitudinal baseline for tracking where the field stands.

Anthology ID:: 2026.retroeval-main.8
Volume:: Proceedings of the 1st Symposium on Natural Language Generation Evaluations
Month:: June
Year:: 2026
Address:: Aberdeen, United Kingdom
Editors:: Saad Mahamood, David M. Howcroft, Kees van Deemter, Simone Balloccu, Adarsa Sivaprasad, Barkavi Sundararajan, Alberto Bugarín Diz, Jose María Alonso-Moral
Venue:: RetroEval
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 63–72
Language:
URL:: https://preview.aclanthology.org/ingest-retroeval/2026.retroeval-main.8/
DOI:
Bibkey:
Cite (ACL):: Patrícia Schmidtová, Saad Mahamood, and Ondřej Dušek. 2026. Never Truly Out of Fashion: A Retrospective Look at Evaluation in NLG. In Proceedings of the 1st Symposium on Natural Language Generation Evaluations, pages 63–72, Aberdeen, United Kingdom. Association for Computational Linguistics.
Cite (Informal):: Never Truly Out of Fashion: A Retrospective Look at Evaluation in NLG (Schmidtová et al., RetroEval 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-retroeval/2026.retroeval-main.8.pdf

PDF Cite Search Fix data