Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

Aleksei Dorkin, Kairit Sirts


Abstract
This study evaluates three different lemmatization approaches to Estonian—Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approach could lead to improvements.
Anthology ID:
2023.nodalida-1.28
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
280–285
Language:
URL:
https://aclanthology.org/2023.nodalida-1.28
DOI:
Bibkey:
Cite (ACL):
Aleksei Dorkin and Kairit Sirts. 2023. Comparison of Current Approaches to Lemmatization: A Case Study in Estonian. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 280–285, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Comparison of Current Approaches to Lemmatization: A Case Study in Estonian (Dorkin & Sirts, NoDaLiDa 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2023.nodalida-1.28.pdf