Ofri Hefetz


2025

We investigate the identification of idiomatic expressions—a semantically non-compositional subclass of multiword expressions (MWEs)—in running text using large language models (LLMs) without any fine-tuning. Instead, we adopt a prompt-based approach and evaluate a range of prompting strategies, including zero-shot, few-shot, and chain-of-thought variants, across multiple languages, datasets, and model types. Our experiments show that, with well-crafted prompts, LLMs can perform competitively with supervised models trained on annotated data. These findings highlight the potential of prompt-based LLMs as a flexible and effective alternative for idiomatic expression identification.
We investigate cross-lingual fine-tuning for idiomatic expression identification, addressing the limited availability of annotated data in many languages. We evaluate encoder and generative decoder models to examine their ability to generalize idiom identification across languages. Additionally, we conduct an explainability study using linear probing and LogitLens to analyze how idiomatic meaning is represented across model layers. Results show consistent cross-lingual transfer, with English emerging as a strong source language. All code and models are released to support future research.