Dependencies over Times and Tools (DoTT)
Andy Luecking, Giuseppe Abrami, Leon Hammerla, Marc Rahn, Daniel Baumartz, Steffen Eger, Alexander Mehler
Abstract
Purpose: Based on the examples of English and German, we investigate to what extent parsers trained on modern variants of these languages can be transferred to older language levels without loss. Methods: We developed a treebank called DoTT (https://github.com/texttechnologylab/DoTT) which covers, roughly, the time period from 1800 until today, in conjunction with the further development of the annotation tool DependencyAnnotator. DoTT consists of a collection of diachronic corpora enriched with dependency annotations using 3 parsers, 6 pre-trained language models, 5 newly trained models for German, and two tag sets (TIGER and Universal Dependencies). To assess how the different parsers perform on texts from different time periods, we created a gold standard sample as a benchmark. Results: We found that the parsers/models perform quite well on modern texts (document-level LAS ranging from 82.89 to 88.54) and slightly worse on older texts, as expected (average document-level LAS 84.60 vs. 86.14), but not significantly. For German texts, the (German) TIGER scheme achieved slightly better results than UD. Conclusion: Overall, this result speaks for the transferability of parsers to past language levels, at least dating back until around 1800. This very transferability, it is however argued, means that studies of language change in the field of dependency syntax can draw on dependency distance but miss out on some grammatical phenomena.- Anthology ID:
- 2024.lrec-main.415
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 4641–4653
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2024.lrec-main.415/
- DOI:
- Cite (ACL):
- Andy Luecking, Giuseppe Abrami, Leon Hammerla, Marc Rahn, Daniel Baumartz, Steffen Eger, and Alexander Mehler. 2024. Dependencies over Times and Tools (DoTT). In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4641–4653, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Dependencies over Times and Tools (DoTT) (Luecking et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2024.lrec-main.415.pdf