Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

Kumiko Tanaka-Ishii

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

Abstract

Evaluating whether large language models (LLMs) capture the structureof natural language beyond local fluency remains an open challenge.Existing evaluation methods, largely based on task performance orshort-context behavior, provide limited insight into the long-rangestatistical organization of generated text.We propose a complementary evaluation framework based on repeatedsubsequences. By analyzing their distribution across scales andrelating it to higher-order Rényi entropies, we probe how textsreuse previously established structure under finite-lengthconditions. Experiments on human-written texts and length-matchedGPT-generated texts show that,while power-law models can describerestricted ranges of block length, the observed entropy growth isoften equally or better characterized by logarithmic–power forms.Across datasets, natural language exhibits stable entropy-growthpatterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast,GPT-generated texts show systematic and statistically significantshifts in estimated exponents with model size.These results demonstrate that repeated-subsequence entropyprovides a quantitative structural diagnostic that revealssystematic differences in long-range organization,distinguishing natural language from state-of-the-art LLM outputsbeyond surface-level fluency.

Anthology ID:: 2026.acl-long.379
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8367–8382
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.379/
DOI:
Bibkey:
Cite (ACL):: Kumiko Tanaka-Ishii. 2026. Repeated Sequences Reveal Gaps between Large Language Models and Natural Language. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8367–8382, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Repeated Sequences Reveal Gaps between Large Language Models and Natural Language (Tanaka-Ishii, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.379.pdf
Checklist:: 2026.acl-long.379.checklist.pdf

PDF Cite Search Checklist Fix data