An Idiom Benchmark for Turkish

Ebru Çavuşoğlu, Cagri Coltekin


Abstract
Despite recent significant advances, idioms, like other forms of figurative language, present a challenge to natural language processing (NLP). Benchmark corpora are essential for improving the current models on understanding idioms. However, such corpora are only available for a limited set of languages. In this paper, we introduce our ongoing work on a benchmark corpus of Turkish idioms. Our corpus is structured for testing both idiom recognition and idiom understanding. The corpus is currently consists of 200 instances with sentences including idiomatic use, their literal paraphrases, similar sentences with no entailment, and non-idiomatic use of the idiomatic expressions when possible. We describe the methodology used to create the corpus, as well as initial experiments with a selection of LLMs.
Anthology ID:
2026.mwe-1.12
Volume:
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Month:
March
Year:
2026
Address:
Rabat, Marocco
Editors:
Atul Kr. Ojha, Verginica Barbu Mititelu, Mathieu Constant, Ivelina Stoyanova, A. Seza Doğruöz, Alexandre Rademaker
Venues:
MWE | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
103–109
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.mwe-1.12/
DOI:
Bibkey:
Cite (ACL):
Ebru Çavuşoğlu and Cagri Coltekin. 2026. An Idiom Benchmark for Turkish. In Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026), pages 103–109, Rabat, Marocco. Association for Computational Linguistics.
Cite (Informal):
An Idiom Benchmark for Turkish (Çavuşoğlu & Coltekin, MWE 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.mwe-1.12.pdf