Using LLMs to Advance Idiom Corpus Construction

Doğukan Arslan; Hüseyin Anıl Çakmak; Gülşen Eryiğit; Joakim Nivre

Using LLMs to Advance Idiom Corpus Construction

Doğukan Arslan, Hüseyin Anıl Çakmak, Gulsen Eryigit, Joakim Nivre

Abstract

Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM’s performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.

Anthology ID:: 2025.mwe-1.4
Volume:: Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, U.S.A.
Editors:: Atul Kr. Ojha, Voula Giouli, Verginica Barbu Mititelu, Mathieu Constant, Gražina Korvel, A. Seza Doğruöz, Alexandre Rademaker
Venues:: MWE | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21–31
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.mwe-1.4/
DOI:
Bibkey:
Cite (ACL):: Doğukan Arslan, Hüseyin Anıl Çakmak, Gulsen Eryigit, and Joakim Nivre. 2025. Using LLMs to Advance Idiom Corpus Construction. In Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025), pages 21–31, Albuquerque, New Mexico, U.S.A.. Association for Computational Linguistics.
Cite (Informal):: Using LLMs to Advance Idiom Corpus Construction (Arslan et al., MWE 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.mwe-1.4.pdf

PDF Cite Search Fix data