Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation

Wajdi Zaghouani

Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation

Abstract

The field of natural language generation has accumulated a rich ecosystem of automatic evaluation metrics, yet it lacks a coherent theory of what those metrics are actually measuring. Drawing on measurement theory from the quantitative social sciences, this paper argues that current NLG evaluation practices suffer from a fundamental construct validity problem: metrics are treated as proxies for output quality without explicit specification of the underlying constructs they are meant to operationalize. We examine four dominant evaluation paradigms (reference-based metrics, embedding-based metrics, LLM-as-judge, and human evaluation) and demonstrate that each conflates construct definition with operationalization. Building on a long psychometric tradition reaching back to Cronbach and Meehl (1955) and on recent NLP work that has begun to apply this tradition to bias measurement, dialogue evaluation, and benchmark design, we propose that the field adopt a measurement modeling perspective for NLG evaluation. We borrow the concepts of construct validity, reliability, and consequential validity as a foundation for more principled evaluation, and we outline a preliminary taxonomy of NLG quality constructs as a starting point for this work.

Anthology ID:: 2026.gem-main.79
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1021–1028
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.79/
DOI:
Bibkey:
Cite (ACL):: Wajdi Zaghouani. 2026. Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1021–1028, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation (Zaghouani, GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.79.pdf

PDF Cite Search Fix data