Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output

Joris Veerbeek; Kas Berendsen; Alessandra Polimeno; Antal van den Bosch

Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output

Joris Veerbeek, Kas Berendsen, Alessandra Polimeno, Antal van den Bosch

Abstract

Demonstrating that large language models have memorized copyrighted material is more feasible for high-volume publishers than for smaller outlets whose content appears less frequently online. This study explores how even short, repeated sequences–rather than full articles–can serve as evidence of memorization. Focusing on Dutch news sources included in the mC4 dataset, we test whether GPT-4 and mT5 reproduce excerpts from thousands of articles, including standardized editorial boilerplate. By comparing results to a post-training baseline and modeling memorization as a survival process, we find that repeated, publication-specific phrases are significantly more likely to be completed verbatim. The approach provides a means to detect empirical evidence of memorization in cases where full reproduction is unlikely.

Anthology ID:: 2026.lrec-main.473
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 5960–5969
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.473/
DOI:
Bibkey:
Cite (ACL):: Joris Veerbeek, Kas Berendsen, Alessandra Polimeno, and Antal van den Bosch. 2026. Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output. International Conference on Language Resources and Evaluation, main:5960–5969.
Cite (Informal):: Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output (Veerbeek et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.473.pdf

PDF Cite Search Fix data