GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian

Aleksei Dorkin, Kairit Sirts


Abstract
We present GliLem—a novel hybrid lemmatization system for Estonian that enhances the highly accurate rule-based morphological analyzer Vabamorf with an external disambiguation module based on GliNER—an open vocabulary NER model that is able to match text spans with text labels in natural language. We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf by 10% compared to its original disambiguation module and achieve an improvement over the token classification-based baseline. To measure the impact of improvements in lemmatization accuracy on the information retrieval downstream task, we first created an information retrieval dataset for Estonian by automatically translating the DBpedia-Entity dataset from English. We benchmark several token normalization approaches, including lemmatization, on the created dataset using the BM25 algorithm. We observe a substantial improvement in IR metrics when using lemmatization over simplistic stemming. The benefits of improving lemma disambiguation accuracy manifest in small but consistent improvement in the IR recall measure, especially in the setting of high k.
Anthology ID:
2025.nodalida-1.10
Volume:
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Richard Johansson, Sara Stymne
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
86–97
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.10/
DOI:
Bibkey:
Cite (ACL):
Aleksei Dorkin and Kairit Sirts. 2025. GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 86–97, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian (Dorkin & Sirts, NoDaLiDa 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.10.pdf