Laura Castro
2025
Gathering Compositionality Ratings of Ambiguous Noun-Adjective Multiword Expressions in Galician
Laura Castro
|
Marcos Garcia
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Multiword expressions pose numerous challenges to most NLP tasks, and so do their compositionality and semantic ambiguity. The need for resources that make it possible to explore such phenomena is rather pressing, even more so in the case of low-resource languages. In this paper, we present a dataset of noun-adjective compounds in Galician with compositionality scores at token level. These MWEs are ambiguous due to being potentially idiomatic expressions, as well as due to the ambiguity and productivity of their constituents. The dataset comprises 240 MWEs that amount to 322 senses, which are contextualized in two sets of sentences, manually created, and extracted from corpora, totaling 1,858 examples. For this dataset, we gathered human judgments on compositionality levels for compounds, heads, and modifiers. Furthermore, we obtained frequency, ambiguity, and productivity data for compounds and their constituents, and we explored potential correlations between mean compositionality scores and these three properties in terms of compounds, heads, and modifiers. This valuable resource helps evaluate language models on (non-)compositionality and ambiguity, key challenges in NLP, and is especially relevant for Galician, a low-resource variety lacking annotated datasets for such linguistic phenomena.
2024
Increasing manually annotated resources for Galician: the Parallel Universal Dependencies Treebank
Xulia Sánchez-Rodríguez
|
Albina Sarymsakova
|
Laura Castro
|
Marcos Garcia
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1