Alba Táboas García
Also published as: Alba Táboas García
2025
Assessing the Agreement Competence of Large Language Models
Alba Táboas García
|
Leo Wanner
Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025)
While the competence of LLMs to cope with agreement constraints has been widely tested in English, only a very limited number of works deals with morphologically rich(er) languages. In this work, we experiment with 25 mono- and multilingual LLMs, applying them to a collection of more than 5,000 test examples that cover the main agreement phenomena in three Romance languages (Italian, Portuguese, and Spanish) and one Slavic Language (Russian). We identify which of the agreement phenomena are most difficult for which models and challenge some common assumptions of what makes a good model. The test suites into which the test examples are organized are openly available and can be easily adapted to other agreement phenomena and other languages for further research.
Exploring morphology-aware tokenization: A case study on Spanish language modeling
Alba Táboas García
|
Piotr Przybyła
|
Leo Wanner
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.
2021
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
Laura Pérez-Mayos
|
Alba Táboas García
|
Simon Mille
|
Leo Wanner
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021