Aleksandra Krasnodębska
Also published as: Aleksandra Krasnodebska
2025
PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnodebska
|
Karolina Seweryn
|
Szymon Łukasik
|
Wojciech Kusa
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.
Rainbow-Teaming for the Polish Language: A Reproducibility Study
Aleksandra Krasnodębska
|
Maciej Chrabaszcz
|
Wojciech Kusa
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
The development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we implement a methodology for the automatic creation of red-teaming datasets for safety evaluation in Polish language. Our approach generates both harmful and non-harmful prompts by sampling different risk categories and attack styles. We test several open-source models, including those trained on Polish data, and evaluate them using metrics such as Attack Success Rate (ASR) and False Reject Rate (FRR). The results reveal clear gaps in safety performance between models and show that better testing across languages is needed.