Abstract
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.- Anthology ID:
- 2024.emnlp-main.800
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 14459–14473
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.800/
- DOI:
- 10.18653/v1/2024.emnlp-main.800
- Cite (ACL):
- William Merrill, Noah A. Smith, and Yanai Elazar. 2024. Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14459–14473, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG (Merrill et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.800.pdf