Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis

Chantal Shaib; Venkata S Govindarajan; Joe Barrow; Jiuding Sun; Alexa Siu; Byron C. Wallace; Ani Nenkova

Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis

Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa Siu, Byron C Wallace, Ani Nenkova

Abstract

The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and “canned” responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and release diversity, an open-source Python package (https://pypi.org/project/diversity/, https://github.com/cshaib/diversity) for measuring and extracting repetition in text. We also build a platform (https://ai-templates.app) based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute n-gram overlap homogeneity scores. Further, a combination of measures—compression ratios, self-repetition of long n-grams, and Self-BLEU—are sufficient to report, as they have low mutual correlation with each other.

Anthology ID:: 2025.ijcnlp-demo.5
Volume:: Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Xuebo Liu, Ayu Purwarianti
Venue:: IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36–46
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-demo.5/
DOI:
Bibkey:
Cite (ACL):: Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa Siu, Byron C Wallace, and Ani Nenkova. 2025. Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis. In Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, pages 36–46, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Standardizing the Measurement of Text Diversity: A Tool and Comparative Analysis (Shaib et al., IJCNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-demo.5.pdf

PDF Cite Search Fix data