PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodębska; Karolina Seweryn; Szymon Łukasik; Wojciech Kusa

PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodebska, Karolina Seweryn, Szymon Łukasik, Wojciech Kusa

Abstract

We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.

Anthology ID:: 2025.bsnlp-1.4
Volume:: Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Jakub Piskorski, Pavel Přibáň, Preslav Nakov, Roman Yangarber, Michal Marcinczuk
Venues:: BSNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25–37
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.4/
DOI:
Bibkey:
Cite (ACL):: Aleksandra Krasnodebska, Karolina Seweryn, Szymon Łukasik, and Wojciech Kusa. 2025. PL-Guard: Benchmarking Language Model Safety for Polish. In Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), pages 25–37, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: PL-Guard: Benchmarking Language Model Safety for Polish (Krasnodebska et al., BSNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.4.pdf

PDF Cite Search Fix data