Abstract
Treebanks annotated with Universal Dependencies (UD) are currently available for over 100 languages and are widely utilized by the community. However, their inherent characteristics are hard to measure and are only partially reflected in parser evaluations via accuracy metrics like LAS. In this study, we analyze a large subset of the UD treebanks using three recently proposed accuracy-free dataset analysis methods: dataset cartography, 𝒱-information, and minimum description length. Each method provides insights about UD treebanks that would remain undetected if only LAS was considered. Specifically, we identify a number of treebanks that, despite yielding high LAS, contain very little information that is usable by a parser to surpass what can be achieved by simple heuristics. Furthermore, we make note of several treebanks that score consistently low across numerous metrics, indicating a high degree of noise or annotation inconsistency present therein.- Anthology ID:
- 2023.eacl-main.76
- Volume:
- Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1076–1089
- Language:
- URL:
- https://aclanthology.org/2023.eacl-main.76
- DOI:
- Cite (ACL):
- Artur Kulmizev and Joakim Nivre. 2023. Investigating UD Treebanks via Dataset Difficulty Measures. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1076–1089, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Investigating UD Treebanks via Dataset Difficulty Measures (Kulmizev & Nivre, EACL 2023)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2023.eacl-main.76.pdf