Dependencies over Times and Tools (DoTT)

Andy Luecking, Giuseppe Abrami, Leon Hammerla, Marc Rahn, Daniel Baumartz, Steffen Eger, Alexander Mehler


Abstract
Purpose: Based on the examples of English and German, we investigate to what extent parsers trained on modern variants of these languages can be transferred to older language levels without loss. Methods: We developed a treebank called DoTT (https://github.com/texttechnologylab/DoTT) which covers, roughly, the time period from 1800 until today, in conjunction with the further development of the annotation tool DependencyAnnotator. DoTT consists of a collection of diachronic corpora enriched with dependency annotations using 3 parsers, 6 pre-trained language models, 5 newly trained models for German, and two tag sets (TIGER and Universal Dependencies). To assess how the different parsers perform on texts from different time periods, we created a gold standard sample as a benchmark. Results: We found that the parsers/models perform quite well on modern texts (document-level LAS ranging from 82.89 to 88.54) and slightly worse on older texts, as expected (average document-level LAS 84.60 vs. 86.14), but not significantly. For German texts, the (German) TIGER scheme achieved slightly better results than UD. Conclusion: Overall, this result speaks for the transferability of parsers to past language levels, at least dating back until around 1800. This very transferability, it is however argued, means that studies of language change in the field of dependency syntax can draw on dependency distance but miss out on some grammatical phenomena.
Anthology ID:
2024.lrec-main.415
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4641–4653
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.lrec-main.415/
DOI:
Bibkey:
Cite (ACL):
Andy Luecking, Giuseppe Abrami, Leon Hammerla, Marc Rahn, Daniel Baumartz, Steffen Eger, and Alexander Mehler. 2024. Dependencies over Times and Tools (DoTT). In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4641–4653, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Dependencies over Times and Tools (DoTT) (Luecking et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.lrec-main.415.pdf