2025
pdf
bib
abs
Using LLMs to Advance Idiom Corpus Construction
Doğukan Arslan
|
Hüseyin Anıl Çakmak
|
Gulsen Eryigit
|
Joakim Nivre
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM’s performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.
pdf
bib
abs
Findings of the UniDive 2025 shared task on multilingual Morpho-Syntactic Parsing
Omer Goldman
|
Leonie Weissweiler
|
Kutay Acar
|
Diego Alves
|
Anna Baczkowska
|
Gulsen Eryigit
|
Lenka Krippnerová
|
Adriana Pagano
|
Tanja Samardžić
|
Luigi Talamo
|
Alina Wróblewska
|
Daniel Zeman
|
Joakim Nivre
|
Reut Tsarfay
Proceedings of The UniDive 2025 Shared Task on Multilingual Morpho-Syntactic Parsing
This paper details the findings of the 2025 UniDive shared task on multilingual morphosyntactic parsing. It introduces a new representation in which morphology and syntax are modelled jointly to form dependency trees of contentful elements, each characterized by features determined by grammatical words and morphemes. This schema allows bypassing the theoretical debate over the definition of “words” and it encourages development of parsers for typologically diverse languages. The data for the task, spanning 9 languages, was annotated based on existing Universal Dependencies (UD) treebanks that were adapted to the new format. We accompany the data with a new metric, MSLAS, that combines syntactic LAS with F1 over grammatical features. The task received two submissions, which together with three baselines give a detailed view on the ability of multi-task encoder models to cope with the task at hand. The best performing system, UM, achieved 78.7 MSLAS macro-averaged over all languages, improving by 31.4 points over the few-shot prompting baseline.
pdf
bib
abs
Typology-aware Multilingual Morphosyntactic Parsing with Functional Node Filtering
Kutay Acar
|
Gulsen Eryigit
Proceedings of The UniDive 2025 Shared Task on Multilingual Morpho-Syntactic Parsing
This paper presents a system for the UniDive Morphosyntactic Parsing (MSP) Shared Task, where it ranked second overall among participating teams. The task introduces a morphosyntactic representation that jointly models syntactic dependencies and morphological features by treating content-bearing elements as graph nodes and encoding functional elements as feature annotations, posing challenges for conventional parsers and necessitating more flexible, linguistically informed approaches. The proposed system combines a typology-aware, multitask parser with a multilingual content/function classifier to handle structural variance across languages. The architecture uses adapter modules and language embeddings to encode typological information. Evaluations across 9 typologically varied languages confirm that the system can accurately replicate both universal and language-specific morphosyntactic patterns.