Beyond the Data: The Impact of Annotation Inconsistencies in UD Treebanks on Typological Universals and Complexity Assessment

Antoni Brosa Rodríguez, M. Dolores Jiménez López


Abstract
This study explores the impact of annotation inconsistencies in Universal Dependencies (UD) treebanks on typological research in computational linguistics. UD provides a standardized framework for cross-linguistic annotation, facilitating large-scale empirical studies on linguistic diversity and universals. However, despite rigorous guidelines, annotation inconsistencies persist across treebanks. The objective of this paper is to assess how these inconsistencies affect typological universals, linguistic descriptions, and complexity metrics. We analyze systematic annotation errors in multiple UD treebanks, focusing on morphological features. Case studies on Spanish and Dutch demonstrate how differing annotation decisions within the same language create contradictory typological profiles. We classify the errors into two main categories: overgeneration errors (features incorrectly annotated, since do not actually exist in a language) and data omission errors (inconsistent or incomplete annotation of features that do exist). Our results show that these inconsistencies significantly distort typological analyses, leading to false generalizations and miscalculations of linguistic complexity. We propose methodological safeguards for typological research using UD data. Our findings highlight the need for methodological improvements to ensure more reliable cross-linguistic generalizations in computational typology.
Anthology ID:
2025.sigtyp-1.5
Volume:
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Month:
August
Year:
2025
Address:
Vinenna. Austria
Editors:
Michael Hahn, Priya Rani, Ritesh Kumar, Andreas Shcherbakov, Alexey Sorokin, Oleg Serikov, Ryan Cotterell, Ekaterina Vylomova
Venues:
SIGTYP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
43–51
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sigtyp-1.5/
DOI:
Bibkey:
Cite (ACL):
Antoni Brosa Rodríguez and M. Dolores Jiménez López. 2025. Beyond the Data: The Impact of Annotation Inconsistencies in UD Treebanks on Typological Universals and Complexity Assessment. In Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 43–51, Vinenna. Austria. Association for Computational Linguistics.
Cite (Informal):
Beyond the Data: The Impact of Annotation Inconsistencies in UD Treebanks on Typological Universals and Complexity Assessment (Rodríguez & López, SIGTYP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.sigtyp-1.5.pdf