Nathalie Carmen Hau Norman

2026

DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors
Gianluca Barmina | Nathalie Carmen Hau Norman | Peter Schneider-Kamp | Lukas Galke Poech
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

pdf bib abs

DAMETA: An LLM Benchmark for Danish Metaphor Interpretation with Systematically Varied Distractors
Nina Skovgaard Schneidermann | Sanni Nimb | Nathalie Carmen Hau Norman | Sussi Olsen | Bolette Pedersen
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We present DAMETA, the first evaluation benchmark for Danish metaphor interpretation in language models, derived from the following sources: an annotated corpus (the Dafig Corpus), the Danish dictionary (DDO) and culture reviews in Danish newspapers. Each of the 900 data instances contains a sentence with a metaphorical target word and four human-created paraphrase options; one correct interpretation and three systematic errors or distractors: i) a false literal paraphrase (typically concrete), ii) a false figurative paraphrase (typically abstract), and iii) a false contradictory paraphrase. The benchmark is tested on seven language models, and 5% of the data is further tested on humans for comparison. Results show, among others, that when informed in the prompt that the target word is a metaphor, the models tend to be most distracted by the false figurative paraphrase; in contrast, when uninformed about the metaphorical setting, the models are more distracted by the false literal paraphrase. The dataset goes beyond standard by incorporating descriptive metadata regarding metaphor conventionality on a 3-graded scale (lexicalised, implicit, and ad-hoc), alongside a range of dictionary-derived source domains (military, gastronomy, health, meteorology, etc.). These metadata enable deeper analysis and potentially innovative insights of model performance regarding creativity, language change, and culture-sensitivity.

pdf bib abs

Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

Nathalie Carmen Hau Norman

2026

Co-authors

Venues