Karolina Seweryn


2025

pdf bib
PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnodebska | Karolina Seweryn | Szymon Łukasik | Wojciech Kusa
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.

pdf bib
PLLuM-Align: Polish Preference Dataset for Large Language Model Alignment
Karolina Seweryn | Anna Kołos | Agnieszka Karlińska | Katarzyna Lorenc | Katarzyna Dziewulska | Maciej Chrabaszcz | Aleksandra Krasnodebska | Paula Betscher | Zofia Cieślińska | Katarzyna Kowol | Julia Moska | Dawid Motyka | Paweł Walkowiak | Bartosz Żuk | Arkadiusz Janz
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.