Jakub Strebeyko
2026
Śmigiel Dataset: Laying Foundations for Investigating Machine-Generated Text Detection in Polish
Jakub Strebeyko | Alina Wróblewska | Piotr Przybyła
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jakub Strebeyko | Alina Wróblewska | Piotr Przybyła
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present Śmigiel, the first open dataset for training and evaluating machine-generated text (MGT) in Polish. The dataset includes a collection of human-written text fragments from six domains, which are used to prompt text generation by eight language models capable of producing credible Polish text. In addition to the raw corpus of over 462K generated texts, we also release a cleaned source- and domain-balanced dataset suitable for training and evaluating MGT detectors. Finally, we conduct preliminary experiments with text classifiers, showing that task difficulty depends on the text domain, the generating language model, and the availability of similar data in training. The results indicate that MGT detection in Polish can be approached with general-purpose classifiers that generalize well to new LLMs, but struggle to adapt to genres not represented in the training data.
2025
PolEval 2025 Task 1 Śmigiel: Spotting Machine-Generated Text from LLMs for Polish
Piotr Przybyła | Jakub Strebeyko | Alina Wróblewska
Proceedings of the PolEval 2025 Workshop
Piotr Przybyła | Jakub Strebeyko | Alina Wróblewska
Proceedings of the PolEval 2025 Workshop
This paper introduces the first shared task on machine-generated text (MGT) detection for Polish, organised as part of the PolEval 2025 evaluation campaign. The task evaluates participating systems under three scenarios — unsupervised, constrained, and open — designed to reflect different levels of access to training data. In total, seven systems were submitted.The results indicate that MGT detection for Polish is feasible, with the best-performing constrained systems achieving over 90% accuracy on the main evaluation set. However, performance drops when models are tested on unseen domains or generator models, revealing substantial limitations in generalisation. In the most challenging settings, unsupervised approaches perform better, despite achieving overall lower performance.This shared task establishes a new benchmark for MGT detection in Polish. The publicly released Śmigiel dataset is intended to support future research on robust and generalisable MGT detection methods.