Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Da Ju; Hagen Blix; Adina Williams

Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Abstract

Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data—Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.

Anthology ID:: 2025.findings-acl.120
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2367–2388
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.120/
DOI:
Bibkey:
Cite (ACL):: Da Ju, Hagen Blix, and Adina Williams. 2025. Domain Regeneration: How well do LLMs match syntactic properties of text domains?. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2367–2388, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Domain Regeneration: How well do LLMs match syntactic properties of text domains? (Ju et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.120.pdf

PDF Cite Search Fix data