MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation
Uliana Sentsova, Debora Ciminari, Josef Van Genabith, Cristina España-Bonet
Abstract
Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’ -idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.- Anthology ID:
- 2025.mwe-1.8
- Volume:
- Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
- Month:
- May
- Year:
- 2025
- Address:
- Albuquerque, New Mexico, U.S.A.
- Editors:
- Atul Kr. Ojha, Voula Giouli, Verginica Barbu Mititelu, Mathieu Constant, Gražina Korvel, A. Seza Doğruöz, Alexandre Rademaker
- Venues:
- MWE | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 67–81
- Language:
- URL:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.mwe-1.8/
- DOI:
- Cite (ACL):
- Uliana Sentsova, Debora Ciminari, Josef Van Genabith, and Cristina España-Bonet. 2025. MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation. In Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025), pages 67–81, Albuquerque, New Mexico, U.S.A.. Association for Computational Linguistics.
- Cite (Informal):
- MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation (Sentsova et al., MWE 2025)
- PDF:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.mwe-1.8.pdf