Maggie Mi
2026
Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection
Dylan Phelps | Rodrigo Wilkens | Edward Gow-Smith | Thomas M. R. Pickard | Maggie Mi | Marco Idiart | Aline Villavicencio
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Dylan Phelps | Rodrigo Wilkens | Edward Gow-Smith | Thomas M. R. Pickard | Maggie Mi | Marco Idiart | Aline Villavicencio
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood in relation to the context before it can be disambiguated. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.
2025
Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context
Maggie Mi | Aline Villavicencio | Nafise Sadat Moosavi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Maggie Mi | Aline Villavicencio | Nafise Sadat Moosavi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Human processing of idioms heavily depends on interpreting the surrounding context in which they appear. While large language models (LLMs) have achieved impressive performance on idiomaticity detection benchmarks, this success may be driven by reasoning shortcuts present in existing datasets. To address this, we introduce a novel, controlled contrastive dataset (DICE) specifically designed to assess whether LLMs can effectively leverage context to disambiguate idiomatic meanings. Furthermore, we investigate the influence of collocational frequency and sentence probability—proxies for human processing known to affect idiom resolution—on model performance. Our results show that LLMs frequently fail to resolve idiomaticity when it depends on contextual understanding, performing better on sentences deemed more likely by the model. Additionally, idiom frequency influences performance but does not guarantee accurate interpretation. Our findings emphasize the limitations of current models in grasping contextual meaning and highlight the need for more context-sensitive evaluation.
SemEval-2025 Task 1: AdMIRe - Advancing Multimodal Idiomaticity Representation
Thomas Pickard | Aline Villavicencio | Maggie Mi | Wei He | Dylan Phelps | Marco Idiart
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Thomas Pickard | Aline Villavicencio | Maggie Mi | Wei He | Dylan Phelps | Marco Idiart
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Idiomatic expressions present a unique challenge in NLP, as their meanings are often notdirectly inferable from their constituent words. Despite recent advancements in Large LanguageModels (LLMs), idiomaticity remains a significant obstacle to robust semantic representation.We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models’ ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models’ representations of idiomaticity.
From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors
Maggie Mi | Aline Villavicencio | Nafise Sadat Moosavi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Maggie Mi | Aline Villavicencio | Nafise Sadat Moosavi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.
2024
Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection
Dylan Phelps | Thomas Pickard | Maggie Mi | Edward Gow-Smith | Aline Villavicencio
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Dylan Phelps | Thomas Pickard | Maggie Mi | Edward Gow-Smith | Aline Villavicencio
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
ShefCDTeam at SemEval-2024 Task 4: A Text-to-Text Model for Multi-Label Classification
Meredith Gibbons | Maggie Mi | Xingyi Song | Aline Villavicencio
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Meredith Gibbons | Maggie Mi | Xingyi Song | Aline Villavicencio
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
This paper presents our findings for SemEval2024 Task 4. We submit only to subtask 1, applying the text-to-text framework using a FLAN-T5 model with a combination of parameter efficient fine-tuning methods - low-rankadaptation and prompt tuning. Overall, we find that the system performs well in English, but performance is limited in Bulgarian, North Macedonian and Arabic. Our analysis raises interesting questions about the effects of labelorder and label names when applying the text-to-text framework.