Isabell Stinessen Haugen

2026

Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

pdf bib abs

TryggLLM: A Benchmark for Evaluating LLM Safety in Norwegian
Samia Touileb | Truls Pedersen | Isabell Stinessen Haugen
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We introduce TryggLLM, the first safety benchmark dataset for Norwegian. The dataset is intended for benchmarking different types of safety issues that can occur when using Norwegian generative language models. We have manually translated two English benchmark datasets, while modifying the content to be aligned with the Norwegian context. The benchmark dataset is composed of two sub-parts: i) prompts annotated by four native speakers, in both the written variants of Norwegian Bokmål (BM) and Nynorsk (NN), such that each native speaker wrote in their preferred variants (two BM and two NN); ii) prompts and target responses, where each of them has a BM and a NN version. We provide detailed descriptions of the data creation process. We also present a thorough manual evaluation of benchmarking existing open Norwegian LLMs using TryggLLM. Our results show that between 18% and 48% of the generated responses are unsafe, across all tested models.

Isabell Stinessen Haugen

2026

Co-authors

Venues