Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

Atul Kr. Ojha, Verginica Barbu Mititelu, Mathieu Constant, Ivelina Stoyanova, A. Seza Doğruöz, Alexandre Rademaker (Editors)


Anthology ID:
2026.mwe-1
Month:
March
Year:
2026
Address:
Rabat, Marocco
Venues:
MWE | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-eacl/2026.mwe-1/
DOI:
ISBN:
979-8-89176-363-0
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.mwe-1.pdf

Noun compounds are generally considered an open challenge for NLP systems, given to the difficulty of interpreting the implicit semantic relation between modifier and head, although the advent of Large Language Models (LLMs) recently led to remarkable performance leaps. However, most evaluations have been carried out on English benchmarks.In our work, we test LLMs on compound semantics understanding in Chinese, adopting two different evaluation scenarios: an extrinsic evaluation in a Natural Language Inference task, and an intrinsic evaluation in which models are directly asked to predict the semantic relation linking the two constituents.Our results show that the bigger and more recent LLMs are able to surpass supervised baselines in the inference task, especially when tested under the few-shot setting. In the more challenging task of selecting the correct interpretation of the compounds out of a fine-grained typology of semantic relations between head and modifier, the best Chinese LLM (Qwen-plus) manages to select the correct option in about one third of the cases.
Figures of Speech (FOS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
We present the annotation of Swedish multiword expressions under the PARSEME annotation scheme, including a new release and a historical overview of previous releases. We provide an overview of the evolution of the Swedish datasets and of inter-annotator agreement. We discuss general guidelines and the development of Swedish-specific guidelines for particle verbs and multiword tokens, as well as additional challenges for the Swedish annotation. We also conduct an initial comparison of Swedish and other Germanic languages, identifying aspects where the PARSEME guidelines require revision to ensure better consistency across languages.
This paper presents the development of a corpus of annotated multiword expressions (MWEs) for Ukrainian. The resource covers four major categories of MWEs: verbal, nominal, adjectival/adverbial, and functional. We describe the methodology used for data selection, the annotation scheme, and the procedures employed during annotation. In addition, the paper discusses some specific types of MWE constructions, illustrating their usage with numerous examples and addressing complex and borderline cases. The resulting corpus is an important resource for linguistic studies and NLP tasks involving MWEs, and is publicly accessible https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0?ref_type=heads.
This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., due to) and phrasal verbs (e.g., turn out). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.
In recent years, language models, both encoder-only and generative, have been applied to a variety of downstream NLP tasks, includingsequence labeling tasks like automatic multi-word expression identification (MWEI). Multiple studies show that, in general, fine-tunedencoder-only models like BERT tend to outperform pretrained generative LLMs on downstream tasks (Arzideh et al., 2025; Ochoa et al.,2025; Bucher and Martini, 2024; Sebok et al., 2025). However, such comparisons are sparse for MWEI, in particular for French, in partdue to the lack of comprehensive gold-standard datasets. In this study, we address this research gap by comparing CamemBERT with gpt-oss and Qwen3 for MWEI, using the French subcorpus of the newly released PARSEME dataset. CamemBERT outperforms both LLMs by large margins in precision, recall, and F1. We complement this numerical evaluation with a qualitative analysis of prediction errors.
IT-systems generate log messages containing important information about the system’s health. To gather information about system entities, we extract technical terms and proper nouns as multi-word expressions (MWEs) from a wide range of log messages from 16 different real systems. We apply Gries’ information-theoretic approach which iteratively calculates the best MWE candidates using an eight-dimensional ranking method. These candidates are evaluated in an annotation study, achieving a precision of 66 %. This value is significantly higher than evaluations on general-purpose texts, demonstrating the higher occurrence of compound technical terms and proper nouns in log messages. The MWEs found can be used to reduce the number of nodes in a system behavior graph while increasing the information density of the nodes.
This paper presents an enhanced version of the Romanian corpus previously annotated only for verbal multiword expressions. The new release extends the annotation to multiword expressions of other parts of speech, following version 2.0 of the PARSEME guidelines. The corpus has been expanded, its new part was automatically morpho-syntactically annotated based on the Universal Dependencies framework, followed by extensive semi-automatic annotation of multiword expressions across all morphological categories. The paper also reports quantitative data on the updated corpus and discusses the distribution and characteristics of Romanian multiword expressions. We also highlight language-specific annotation challenges and issues arising from the PARSEME 2.0 guidelines.
This paper explores the behavior of neural machine translation models on two newly introduced datasets containing noun-adjective MWEs with different degrees of semantic ambiguity and compositionality. We compare general-domain machine translation systems with fine-tuned models exposed to small subsets of the target MWEs. By assessing the effects of the learning steps and corpus size, we found that carefully designed fine-tuned may improve MWE handling while mitigating catastrophic forgetting. However, our error analysis reveals that models still struggle in several scenarios, particularly when translating MWEs with idiomatic meanings. Both the datasets and the experiments focus on translation involving Galician, English, and Spanish.
Multiword Expressions (MWEs) are pervasive in scientific writing, and in specialized domains they include both multiword terminology (e.g., noun compounds) and recurrent academic phrasing. This study profiles MWEs in a large corpus of bioinformatics research articles segmented by IMRaD sections. Building on recent multi-method approaches to scientific MWE identification, we extract MWEs using complementary automated strategies (semantic matching, dependency parsing, controlled vocabularies, and academic formula lists) and compare the resulting inventories by size, form, and IMRaD section distribution. We further quantify cross-document dispersion using document frequency and Gries’ DP to distinguish widely reused expressions from items concentrated in a small subset of articles. Results show that bioinformatics MWEs are predominantly short and nominal, but that extraction methods differ in the extent to which they recover discourse and reporting phraseology. Dispersion is strongly long-tailed across sections with most MWEs being document-specific, while a smaller recurrent core aligns with section function and is enriched for conventional templates and standardized multiword terms. Overall, the findings argue for combining complementary identification methods with dispersion profiling to characterize domain "multiwordness" in a principled and section-sensitive way.
Multiword expressions are an important area of study in linguistics and natural language processing as they represent combination of words that function as a single unit, and display properties that cannot be predicated fully from their individual components. This paper describes annotated corpora of about 3000 multiword expressions across syntactic categories in Marathi. This is the first exhaustive resource for Marathi which includes both verbal and non-verbal multiwords. In order to develop the guidelines for annotation, we have used the existing literature on the identification and classification of these expressions. Following the PARSEME 2.0 guidelines, we discuss the categories of multiwords and their behaviour in the corpus. Throughout the annotation process, we encounter variability in compositionality and syntactic realization and discuss our design decisions during annotation. Such a dataset will further our understanding of how grammatical structure can be integrated with lexically stored multiword units in Marathi.
Despite recent significant advances, idioms, like other forms of figurative language, present a challenge to natural language processing (NLP). Benchmark corpora are essential for improving the current models on understanding idioms. However, such corpora are only available for a limited set of languages. In this paper, we introduce our ongoing work on a benchmark corpus of Turkish idioms. Our corpus is structured for testing both idiom recognition and idiom understanding. The corpus is currently consists of 200 instances with sentences including idiomatic use, their literal paraphrases, similar sentences with no entailment, and non-idiomatic use of the idiomatic expressions when possible. We describe the methodology used to create the corpus, as well as initial experiments with a selection of LLMs.
Multiword expressions (MWEs) are good examples of a phenomenon where identification systems struggle with generalisation: MWE present in the test set but absent in the training set are rarely identified. This raises the question of the diversity of the test set, relative to that of the train set, and how this impacts performance. We set out to measure how much diversity of a train corpus increases when adding individual MWEs from the test corpus, and how this increase impacts MWE identification performance. We measure diversity across a three-dimension framework and find mostly consistent negative correlations with performance in 14 languages and 8 systems.
Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts.In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing frameworks.In this paper, we present a study of Korean functional multiword expressions: postpositional verb-based constructions (PVCs).Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them from non-MWEs with similar structure.Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.
Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe 2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision–language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English, and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
This paper describes the system submitted for the MWE 2026 Shared Task (AdMIRe 2.0 Subtask A). The submission focused on a text-centric approach, reframing the idiom-image alignment task as a sentence-pair classification problem using mBERT (Multilingual BERT). The submitted system relied on full fine-tuning using only the English training data, achieving a Top-1 Accuracy of approximately 0.30 on the blind test set. Following the evaluation phase, significant limitations were identified in the cross-lingual generalization of the base model. In a post-evaluation study, the backbone was upgraded to XLM-RoBERTa-Large-XNLI, incorporating Low-Rank Adaptation (LoRA) and utilizing the full multilingual dataset with hard negative mining. These improvements boosted the accuracy to 0.41, demonstrating the necessity of NLI-specific pre-training and parameter-efficient tuning for MWE-aware multimodal tasks.
This paper presents the system developedby team alexandru412 for the AdMIRe 2.0Shared Task. We participated in the Text-Onlytrack, ranking images based on idiomatic us-age without accessing pixel data. Our approachcombines a strict list-wise ranking strategy withsystematic test-time augmentation. We fine-tuned a Large Language Model (LLM) on En-glish and Portuguese data and relied on zero-shot transfer for other languages. Our systemachieved the 3rd place in the Text-Only track.
This paper describes a multilingual system for automatic multiword expression identification for PARSEME 2.0 Subtask 1. We formulate MWE identification as a token-level sequence labeling problem using a BIO tagging scheme and fine-tune XLM-RoBERTa-base on PARSEME 2.0. We mainly investigate cross-lingual interactions on language pairs, and test hypotheses whether using a given language pair for training improves MWE detection performance on both or one of the languages. Then, we apply selected successful language pairs on PARSEME 2.0 MWE Identification task. Experiments are conducted independently for a subset of the languages given in PARSEME 2.0, for a total of 8 languages. Our approach achieves strong token-based and span-based F1 scores across diverse languages, and we observe that training with even distant language pairs may result in improvement on at least one of the languages. We publish our code at https://github.com/ahmeterdem1/parseme-blg505
We address AdMIRe 2.0, a static image ranking task where a sentence containing a potentially idiomatic expression is paired with five image–caption candidates, and the goal is to rank the candidates by semantic compatibility with the intended idiomatic or literal meaning. We propose IMMCAN, which keeps XLM-R and Jina-CLIP-v2 frozen and learns a lightweight two-stage cross-attention fusion, caption–image grounding followed by idiom-to-multimodal conditioning, to predict a compatibility score per candidate. We also evaluate caption-only augmentation via back-translation and synonym substitution, and compare regression and rank-class formulations. On AdMIRe 1.0, text-only achieves higher test top-image accuracy than VLM-grounded modeling. In contrast, on AdMIRe 2.0 zero-shot, adding visual patch grounding improves both accuracy and NDCG indicating better cross-lingual ranking transfer.
Multi-Word Expressions (MWEs) pose a significant challenge for natural language processing systems due to their idiosyncratic semantic and syntactic properties. This paper describes our system for the PARSEME 2.0 Shared Task on automatic identification of verbal MWEs across 17 typologically diverse languages. Our approach combines multilingual BERT with explicit Part-of-Speech (POS) feature injection through a dual-head architecture that jointly performs BIO-based identification and category classification. We further investigate extensions, including Conditional Random Field (CRF) decoding for structured prediction, focal loss for addressing class imbalance, and model ensembling for improving discontinuous MWE detection. Our official submission achieves a global MWE-based F1 score of 48.39%, securing second place in the shared task. Ablation studies reveal a strong synergy between POS features and CRF decoding, with the combined approach yielding the best single-model performance. Furthermore, ensembling models trained with different objectives improves both overall F1 score and discontinuous MWE scores, demonstrating the importance of training diversity for capturing non-adjacent syntactic patterns.
Idiomatic expressions pose a fundamental challenge for multimodal understanding due to their non-compositional semantics, while pretrained vision–language models tend to over-rely on literal visual alignments. We address this issue in the context of the AdMIRe 2.0 multimodal idiomatic image ranking task by introducing CARIM (Category-Aware Reasoning for Idiomatic Multimodality), an inference-time framework that injects structured semantic reasoning without end-to-end retraining.Experiments on the official Codabench leaderboard demonstrate that CARIM achieves competitive Top-1 Accuracy and nDCG across multiple languages. Additional post-competition evaluation on the released test annotations further shows that CARIM maintains robust multilingual performance, highlighting the effectiveness of inference-time category-aware reasoning for multimodal idiomatic grounding.
Multi-word expressions (MWEs) remain a challenge for NLP systems due to their syntactic variability and non-compositional semantics, that is why this issue was proposed as shared task within Unidive organization. With increasing popularity of large language models (LLM) it is important to continue researching alternative solutions. One of classical approaches for identifying MWEs is calculating pointwise mutual information (PMI), but this is a purely statistical approach that cannot unveil the links between words in natural text. To fix this issue we propose this paper with a simple syntax-aware PMI method that leverages Universal Dependency (UD) trees (Nivre et al.,2016) to model co-occurrence between syntactically related words. By computing PMI over dependency-linked word pairs and aggregating these scores, we aim to improve surface-based methods. Opposed to expectations, our experiment shows that classical statistical approach gets better results in identifying MWEs partially. Still, this approach is aimed to find a balance between lightweight calculations as opposed to LLMs and precision in results.
Multiword expressions (MWEs), particularlyidioms, pose persistent challengesfor vision-language systems due to theirnon-compositional semantics and culturallygrounded meanings. This paper presentsGLIMMER, a three-stage hybrid ranking systemthat evaluates how well images expressthe intended meaning of MWEs across 15 languages.Our approach uses LLM-generatedsemantic glosses as multilingual meaning anchors,combined with dual-path embeddingscoring (textual captions and visual features),and LLM-based semantic verification. Evaluatedon the ADMIRE shared task benchmark,GLIMMER achieves competitive performanceacross diverse languages without relying onparallel training data or language-specific resources.The results show that using glossesto anchor meaning helps match idioms withimages across languages and modalities, andthat combining retrieval with reasoning is morerobust than using embeddings alone.
We present IPN, our system for Subtask 1 of the PARSEME 2.0 Shared Task, which targets the identification of MWEs in 17 languages. Overall, IPN outperformed a much larger-parameter baseline model, yet a performance gap to the top-performing systems remains. To better understand these results, we investigate Qwen3-32B’s suitability for mono-, cross- and multilingual MWE identification. We also explore whether this model benefits from prepending automatically generated thinking data to the gold label during instruction-tuning. We find that target language data is vital for instruction-tuning. Prepending generated thinking data to a subset of the training data slightly improves performance for two out of three languages, but more detailed evaluation is required.
This paper describes the system submitted by Semantic Stars Team for Subtask 2 of the PARSEME 2.0 shared task (Paraphrasing Multiword Expressions). Our approach addresses the challenge of paraphrasing sentences containing MWEs such that the MWE is removed while the original meaning and grammatical structure are preserved. The paper describes multiple distinct approaches powered by open-weight Large Language Models (LLMs), each employing a combination of different techniques such as prompting, multi-agent pipelines and classical NLP methods. Four distinct methods are tested on the test data in French, including a fifth one combining the results from the first four. We tested with several different open-weight LLMs including Llama3.1:8b, Qwen3:8b and gpt-oss-120b and were able to achieve significant improvements over the baseline, securing the first place on the shared task leader board.
This paper describes MorphoFiltered-Gemini, a multilingual system submitted to the PARSEME 2.0 shared task on multiword expression (MWE) identification. The system relies on Google Gemini 2.0 Flash-Lite to generate MWE predictions using zero-shot and selectively applied few-shot prompting, without fine-tuning or language-specific resources. To reduce the tendency of large language models to over-generate MWEs, we introduce a lightweight morphological post-filter that removes unlikely constructions while preserving high-precision patterns.Rather than optimizing peak performance for individual languages, our approach prioritizes precision and cross-lingual robustness. As a result, the system exhibits stable behavior across 17 typologically diverse languages and achieves the highest Shannon evenness score among all submitted systems. The experimental results highlight a clear trade-off between recall-oriented LLM prompting strategies and precision-oriented filtering, and show that simple linguistic constraints can effectively improve the stability of LLM-based multilingual MWE identification systems.
This paper presents our methods for the AdMIRe 2.0 shared task, which addresses multilingual and multimodal idiom understanding. Our submission focuses on the text-only track. Specifically, we employ an ensemble of three large language models (LLMs) to directly perform the presented image ranking task. Each model independently produces a ranking of the candidate images, and we aggregate their outputs using a hard voting strategy to determine the final prediction. This ensemble learning framework leverages the complementary strengths of different LLMs, improving robustness and reducing the variance of individual model predictions.
This paper describes MISP (Multilingual Id-iomatic Sentence Paraphrasing), a system sub-mitted to the PARSEME 2.0 MultilingualShared Task on Identification and Paraphras-ing of Multiword Expressions (MWEs). Weparticipated in Subtask 2 on MWE para-phrasing and developed our system based onQwen3-4B-Instruct fine-tuned on syntheticPortuguese MWE paraphrases. We appliedMISP not only to Portuguese, but also to Frenchand Romanian, aiming to leverage cross-lingualtransfer within related languages, with ours be-ing the only submission for Portuguese. Ourresults indicate that MISP struggles to generateparaphrases that both rephrase and preserve theoriginal meaning of the MWE. Additionally,instruction fine-tuning does not appear to im-prove performance. Overall, our findings high-light the challenges of paraphrasing MWEs,particularly in a cross-lingual setting
This paper presents our system for the MWE-2026 ADMiRe 2.0 shared task, which aimedto advance multimodal idiomatic understand-ing across 15 languages. We address the taskof selecting, from a set of five images, theone that best represents either the literal oridiomatic meaning of a given compound incontext. Our approach follows a multi-steppipeline: a large language model (LLM) firstdetermines whether the compound is used lit-erally or idiomatically and generates auxiliarytext, consisting of an idiomatic meaning expla-nation and a visual description of the literalmeaning. An ensemble of three CLIP modelsthen identifies the two images most semanti-cally similar to the appropriate generated textvia a voting mechanism. Finally, the LLM se-lects the best image from these two candidates.
This paper presents our system for AdMIRe 2 (Advancing Multimodal Idiomaticity Representation), a shared task on multilingual multimodal idiom understanding. The task focuses on ranking images according to how well they depict the literal or idiomatic usage of potentially idiomatic expressions (PIEs) in context, across 15 languages and two tracks: a text-only track, and a multimodal track that uses both images and captions. To tackle both tracks, we propose a hybrid zero-shot pipeline built on large vision–language models (LVLMs). Our system employs a chain-of-thought prompting scheme that first classifies each PIE usage as literal or idiomatic and then ranks candidate images by their alignment with the inferred meaning.A primary–fallback routing mechanism increases robustness to safety-filter refusals, while lightweight post-processing recovers consistent rankings from imperfect model outputs.Without any task-specific fine-tuning, our approach achieves 55.9% Top-1 Accuracy in the text-only track and 60.1% in the multimodal (text+image) track, ranking first overall on the official leaderboard. These results suggest that carefully designed zero-shot LVLM pipelines can provide strong baselines for multilingual multimodal idiomaticity benchmarks.
This paper presents our approach to the PARSEME 2.0 Shared Task on Romanian, covering both Identification (Subtask 1) and Paraphrasing (Subtask 2). While Large Language Models (LLMs) excel at semantic generation, we hypothesize that they lack the structural precision required for MWE identification, leading to "boundary hallucinations" that compromise downstream simplification. Our Rank 1 results on Romanian confirm this: a specialized encoder (RoBERT) using standard sequence labeling outperforms both few-shot LLMs and complex structural parsers (MTLB-STRUCT). This justifies our proposed pipeline: using encoders as precise “pointers” to guide the generative power of LLMs.
We describe a zero-shot system for AdMIRe 2.0, a shared task on multimodal understanding of potentially idiomatic expressions (PIEs). Given a context sentence with a PIE and five candidate images, the system predicts whether the usage is literal or idiomatic and ranks images by how well they match the intended meaning. We use closed-source large multimodal models and compare prompting pipelines from direct one-step ranking to modular multi-step pipelines that separate sense prediction, PIE-focused image semantics, and final ranking. All steps produce constrained JSON outputs to enable deterministic parsing and composition. In the official AdMIRe 2.0 evaluation on CodaBench, our best pipeline achieves an average Top-1 accuracy of 0.52 and an average nDCG score of 0.70 across the 12 languages we submitted. We obtain the best score among submitted systems in 10 of these languages.
Multiword expressions (MWEs) have been a major challenge in NLP for decades and research on MWEs was driven notably by shared tasks, including those organized by the PARSEME community. We report the organisation and the results of edition 2.0 of the PARSEME shared task. For the first time, all syntactic categories are covered: verbal, nominal, adjectival, adverbial and functional. We rely on edition 2.0 of the PARSEME corpus, annotated for all these categories in 17 languages. We create a new dataset with paraphrases of sentences containing idioms in 14 languages, and defining a new subtask dedicated to MWE paraphrasing. We extend our evaluation protocol by measuring both performance and diversity of systems, and including manual evaluation in paraphrasing. 10 systems, including the baseline, participated in the MWE identification subtask and 5 in the paraphrasing subtask. Results are promising, but known MWE identification challenges remain unsolved. Performance correlates positively with diversity in MWE identification, and negatively in MWE paraphrasing.
Idiomatic expressions present a unique chal-lenge in NLP, as their meanings are often notdirectly inferable from their constituent words.Despite recent advancements in large languagemodels, idiomaticity remains a significant ob-stacle to robust semantic representation. Wepresent datasets and task results for MWE-2026 Shared Task 2: Advancing MultimodalIdiomaticity Representation 2 (AdMIRe 2),which challenges the community to assess andimprove models’ ability to interpret idiomaticexpressions in multimodal contexts across mul-tiple languages. Participants competed in animage ranking task in which, for each item,systems receive a context sentence containinga potentially idiomatic expression (PIE) andfive candidate images. Participating systemsare required to predict the sentence type (i.e.,idiomatic vs. literal) for the given context andrank the images by how well they depict the in-tended meaning in that context. Among the par-ticipating systems the most effective methodsinclude pipelines utilizing closed-source com-mercial models such as Gemini 2.5 and GPT-5, and employing chain-of-thought reasoningstrategies. Methods to mitigate language mod-els’ bias towards literal interpretations and en-sembles to smooth out variance were common.