Data-driven Choices in Neural Part-of-Speech Tagging for Latin

Geoff Bacon


Abstract
Textual data in ancient and historical languages such as Latin is increasingly available in machine readable forms, yet computational tools to analyze and process this data are still lacking. We describe our system for part-of-speech tagging in Latin, an entry in the EvaLatin 2020 shared task. Based on a detailed analysis of the training data, we make targeted preprocessing decisions and design our model. We leverage existing large unlabelled resources to pre-train representations at both the grapheme and word level, which serve as the inputs to our LSTM-based models. We perform an extensive cross-validated hyperparameter search, achieving an accuracy score of up to 93 on in-domain texts. We publicly release all our code and trained models in the hope that our system will be of use to social scientists and digital humanists alike. The insights we draw from our inital analysis can also inform future NLP work modeling syntactic information in Latin.
Anthology ID:
2020.lt4hala-1.17
Volume:
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LT4HALA
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
111–113
Language:
English
URL:
https://aclanthology.org/2020.lt4hala-1.17
DOI:
Bibkey:
Cite (ACL):
Geoff Bacon. 2020. Data-driven Choices in Neural Part-of-Speech Tagging for Latin. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pages 111–113, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):
Data-driven Choices in Neural Part-of-Speech Tagging for Latin (Bacon, LT4HALA 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-url/2020.lt4hala-1.17.pdf