Variety delights (sometimes) - Annotation differences in morphologically annotated corpora

Andrea Dömötör, Balázs Indig, Dávid Márk Nemeskey


Abstract
The goal of annotation standards is to ensure consistency across different corpora and languages. But do they succeed? In our paper we experiment with morphologically annotated Hungarian corpora of different sizes (ELTE DH gold standard corpus, NYTK-NerKor, and Szeged Treebank) to assess their compatibility as a merged training corpus for morphological analysis and disambiguation. Our results show that combining any two corpora not only failed to improve the results of the trained tagger but even degraded them due the inconsistent annotations. Further analysis of the annotation differences among the corpora revealed inconsistencies of several sources: different theoretical approach, lack of consensus, and tagset conversion issues.
Anthology ID:
2025.law-1.22
Volume:
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Siyao Peng, Ines Rehbein
Venues:
LAW | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
270–278
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.law-1.22/
DOI:
10.18653/v1/2025.law-1.22
Bibkey:
Cite (ACL):
Andrea Dömötör, Balázs Indig, and Dávid Márk Nemeskey. 2025. Variety delights (sometimes) - Annotation differences in morphologically annotated corpora. In Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025), pages 270–278, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Variety delights (sometimes) - Annotation differences in morphologically annotated corpora (Dömötör et al., LAW 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.law-1.22.pdf