Marianne Schaaphok

2026

CrowS-Pairs-NL: A Benchmark to Evaluate Dutch Stereotype Bias in LLMs
Jens van der Weide | Dong Nguyen | Marianne Schaaphok | Roos M. Bakker
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)

Bias benchmarks for LLMs largely focus on English, overlooking language- and culture-specific stereotypes. We introduce CrowS-Pairs-NL, a Dutch stereotype benchmark built by filtering, translating, and adapting the English CrowS-Pairs dataset to address known conceptual pitfalls, and extending it with newly crowdsourced Dutch sentence pairs. We evaluate six multilingual and Dutch-trained models using both a pseudo-log-likelihood metric adapted for autoregressive models and a prompt-based metric with three template variants. Models explicitly trained on Dutch data consistently exhibit higher stereotyping scores, suggesting that language-specific fine-tuning introduces language-specific bias. The two metrics broadly agree on model rankings but differ in sensitivity, with the prompt metric showing a narrower range of scores. Our benchmark and findings underscore the need for culturally grounded bias evaluation beyond English.

Co-authors

Venues

StereACuLT1
WS1

Fix author