Laura Castro
2026
Incorporating Multiword Expressions in Galician Neural Machine Translation: Compositionality, Efficiency, and Performance
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper explores the behavior of neural machine translation models on two newly introduced datasets containing noun-adjective MWEs with different degrees of semantic ambiguity and compositionality. We compare general-domain machine translation systems with fine-tuned models exposed to small subsets of the target MWEs. By assessing the effects of the learning steps and corpus size, we found that carefully designed fine-tuned may improve MWE handling while mitigating catastrophic forgetting. However, our error analysis reveals that models still struggle in several scenarios, particularly when translating MWEs with idiomatic meanings. Both the datasets and the experiments focus on translation involving Galician, English, and Spanish.
2025
Gathering Compositionality Ratings of Ambiguous Noun-Adjective Multiword Expressions in Galician
Laura Castro | Marcos Garcia
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Laura Castro | Marcos Garcia
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Multiword expressions pose numerous challenges to most NLP tasks, and so do their compositionality and semantic ambiguity. The need for resources that make it possible to explore such phenomena is rather pressing, even more so in the case of low-resource languages. In this paper, we present a dataset of noun-adjective compounds in Galician with compositionality scores at token level. These MWEs are ambiguous due to being potentially idiomatic expressions, as well as due to the ambiguity and productivity of their constituents. The dataset comprises 240 MWEs that amount to 322 senses, which are contextualized in two sets of sentences, manually created, and extracted from corpora, totaling 1,858 examples. For this dataset, we gathered human judgments on compositionality levels for compounds, heads, and modifiers. Furthermore, we obtained frequency, ambiguity, and productivity data for compounds and their constituents, and we explored potential correlations between mean compositionality scores and these three properties in terms of compounds, heads, and modifiers. This valuable resource helps evaluate language models on (non-)compositionality and ambiguity, key challenges in NLP, and is especially relevant for Galician, a low-resource variety lacking annotated datasets for such linguistic phenomena.