Transformer-based Part-of-Speech Tagging and Lemmatization for Latin

Krzysztof Wróbel, Krzysztof Nowak


Abstract
The paper presents a submission to the EvaLatin 2022 shared task. Our system places first for lemmatization, part-of-speech and morphological tagging in both closed and open modalities. The results for cross-genre and cross-time sub-tasks show that the system handles the diachronic and diastratic variation of Latin. The architecture employs state-of-the-art transformer models. For part-of-speech and morphological tagging, we use XLM-RoBERTa large, while for lemmatization a ByT5 small model was employed. The paper features a thorough discussion of part-of-speech and lemmatization errors which shows how the system performance may be improved for Classical, Medieval and Neo-Latin texts.
Anthology ID:
2022.lt4hala-1.31
Volume:
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rachele Sprugnoli, Marco Passarotti
Venue:
LT4HALA
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
193–197
Language:
URL:
https://aclanthology.org/2022.lt4hala-1.31
DOI:
Bibkey:
Cite (ACL):
Krzysztof Wróbel and Krzysztof Nowak. 2022. Transformer-based Part-of-Speech Tagging and Lemmatization for Latin. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 193–197, Marseille, France. European Language Resources Association.
Cite (Informal):
Transformer-based Part-of-Speech Tagging and Lemmatization for Latin (Wróbel & Nowak, LT4HALA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/improve-issue-templates/2022.lt4hala-1.31.pdf