Alessandra Polimeno

2026

Memorization or Lucky Guesses: Detecting Short Sequences from Copyrighted Dutch News in LLM Output
Joris Veerbeek | Kas Berendsen | Alessandra Polimeno | Antal van den Bosch
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Demonstrating that large language models have memorized copyrighted material is more feasible for high-volume publishers than for smaller outlets whose content appears less frequently online. This study explores how even short, repeated sequences–rather than full articles–can serve as evidence of memorization. Focusing on Dutch news sources included in the mC4 dataset, we test whether GPT-4 and mT5 reproduce excerpts from thousands of articles, including standardized editorial boilerplate. By comparing results to a post-training baseline and modeling memorization as a survival process, we find that repeated, publication-specific phrases are significantly more likely to be completed verbatim. The approach provides a means to detect empirical evidence of memorization in cases where full reproduction is unlikely.

2024

pdf bib abs

Topic-specific social science theory in stance detection: a proposal and interdisciplinary pilot study on sustainability initiatives
Myrthe Reuver | Alessandra Polimeno | Antske Fokkens | Ana Isabel Lopes
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

Topic-specificity is often seen as a limitation of stance detection models and datasets, especially for analyzing political and societal debates. However, stances contain topic-specific aspects that are crucial for an in-depth understanding of these debates. Our interdisciplinary approach identifies social science theories on specific debate topics as an opportunity for further defining stance detection research and analyzing online debate. This paper explores sustainability as debate topic, and connects stance to the sustainability-related Value-Belief-Norm (VBN) theory. VBN theory states that arguments in favor or against sustainability initiatives contain the dimensions of feeling power to change the issue with the initiative, and thinking whether or not the initiative tackles an urgent threat to the environment. In a pilot study with our Reddit European Sustainability Initiatives corpus, we develop an annotation procedure for these complex concepts. We then compare crowd-workers with Natural Language Processing experts’ annotation proficiency. Both crowd-workers and NLP experts find the tasks difficult, but experts reach more agreement on some difficult examples. This pilot study shows that complex theories about debate topics are feasible and worthwhile as annotation tasks for stance detection.

Co-authors

Antal van den Bosch 1

Venues

Fix author