Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output
Joris Veerbeek, Kas Berendsen, Alessandra Polimeno, Antal van den Bosch
Abstract
Demonstrating that large language models have memorized copyrighted material is more feasible for high-volume publishers than for smaller outlets whose content appears less frequently online. This study explores how even short, repeated sequences–rather than full articles–can serve as evidence of memorization. Focusing on Dutch news sources included in the mC4 dataset, we test whether GPT-4 and mT5 reproduce excerpts from thousands of articles, including standardized editorial boilerplate. By comparing results to a post-training baseline and modeling memorization as a survival process, we find that repeated, publication-specific phrases are significantly more likely to be completed verbatim. The approach provides a means to detect empirical evidence of memorization in cases where full reproduction is unlikely.- Anthology ID:
- 2026.lrec-main.473
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 5960–5969
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.473/
- DOI:
- Cite (ACL):
- Joris Veerbeek, Kas Berendsen, Alessandra Polimeno, and Antal van den Bosch. 2026. Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output. International Conference on Language Resources and Evaluation, main:5960–5969.
- Cite (Informal):
- Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output (Veerbeek et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.473.pdf