Karolina Seweryn
2025
PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnodebska
|
Karolina Seweryn
|
Szymon Łukasik
|
Wojciech Kusa
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.