Doğukan Arslan

2026

Idiomatic expressions present a unique chal-lenge in NLP, as their meanings are often notdirectly inferable from their constituent words.Despite recent advancements in large languagemodels, idiomaticity remains a significant ob-stacle to robust semantic representation. Wepresent datasets and task results for MWE-2026 Shared Task 2: Advancing MultimodalIdiomaticity Representation 2 (AdMIRe 2),which challenges the community to assess andimprove models’ ability to interpret idiomaticexpressions in multimodal contexts across mul-tiple languages. Participants competed in animage ranking task in which, for each item,systems receive a context sentence containinga potentially idiomatic expression (PIE) andfive candidate images. Participating systemsare required to predict the sentence type (i.e.,idiomatic vs. literal) for the given context andrank the images by how well they depict the in-tended meaning in that context. Among the par-ticipating systems the most effective methodsinclude pipelines utilizing closed-source com-mercial models such as Gemini 2.5 and GPT-5, and employing chain-of-thought reasoningstrategies. Methods to mitigate language mod-els’ bias towards literal interpretations and en-sembles to smooth out variance were common.

pdf bib abs

Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

2025

pdf bib abs

Using LLMs to Advance Idiom Corpus Construction
Doğukan Arslan | Hüseyin Anıl Çakmak | Gülşen Eryiğit | Joakim Nivre
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)

Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM’s performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.

Doğukan Arslan

2026

2025

Co-authors

Venues