Evaluating the Evaluators: Are readability metrics good measures of readability?

Isabel Cachola; Daniel Khashabi; Mark Dredze

Evaluating the Evaluators: Are readability metrics good measures of readability?

Isabel Cachola, Daniel Khashabi, Mark Dredze

Abstract

Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries.

Anthology ID:: 2025.emnlp-main.1225
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24022–24038
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1225/
DOI:
Bibkey:
Cite (ACL):: Isabel Cachola, Daniel Khashabi, and Mark Dredze. 2025. Evaluating the Evaluators: Are readability metrics good measures of readability?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24022–24038, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Evaluating the Evaluators: Are readability metrics good measures of readability? (Cachola et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1225.pdf
Checklist:: 2025.emnlp-main.1225.checklist.pdf

PDF Cite Search Checklist Fix data