PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodebska, Karolina Seweryn, Szymon Łukasik, Wojciech Kusa


Abstract
We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.
Anthology ID:
2025.bsnlp-1.4
Volume:
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Jakub Piskorski, Pavel Přibáň, Preslav Nakov, Roman Yangarber, Michal Marcinczuk
Venues:
BSNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25–37
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.4/
DOI:
Bibkey:
Cite (ACL):
Aleksandra Krasnodebska, Karolina Seweryn, Szymon Łukasik, and Wojciech Kusa. 2025. PL-Guard: Benchmarking Language Model Safety for Polish. In Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), pages 25–37, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
PL-Guard: Benchmarking Language Model Safety for Polish (Krasnodebska et al., BSNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.4.pdf