Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts

Nikolay Mikhaylovskiy

doi:10.18653/v1/2025.findings-emnlp.837

Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts

Abstract

The frequency distribution of words in human-written texts roughly follows a simple mathematical form known as Zipf’s law. Somewhat less well known is the related Heaps’ law, which describes a sublinear power-law growth of vocabulary size with document size. We study the applicability of Zipf’s and Heaps’ laws to texts generated by Large Language Models (LLMs). We empirically show that Heaps’ and Zipf’s laws only hold for LLM-generated texts in a narrow model-dependent temperature range. These temperatures have an optimal value close to t=1 for all the base models except the large Llama models, are higher for instruction-finetuned models and do not depend on the model size or prompting. This independently confirms the recent discovery of sampling temperature dependent phase transitions in LLM-generated texts.

Anthology ID:: 2025.findings-emnlp.837
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15469–15481
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.837/
DOI:: 10.18653/v1/2025.findings-emnlp.837
Bibkey:
Cite (ACL):: Nikolay Mikhaylovskiy. 2025. Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15469–15481, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts (Mikhaylovskiy, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.837.pdf
Checklist:: 2025.findings-emnlp.837.checklist.pdf

PDF Cite Search Checklist Fix data