Estonian Named Entity Recognition: New Datasets and Models

Kairit Sirts


Abstract
This paper presents the annotation process of two Estonian named entity recognition (NER) datasets, involving the creation of annotation guidelines for labeling eleven different types of entities. In addition to the commonly annotated entities such as person names, organization names, and locations, the annotation scheme encompasses geopolitical entities, product names, titles/roles, events, dates, times, monetary values, and percents. The annotation was performed on two datasets, one involving reannotating an existing NER dataset primarily composed of news texts and the other incorporating new texts from news and social media domains. Transformer-based models were trained on these annotated datasets to establish baseline predictive performance. Our findings indicate that the best results were achieved by training a single model on the combined dataset, suggesting that the domain differences between the datasets are relatively small.
Anthology ID:
2023.nodalida-1.76
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
752–761
Language:
URL:
https://aclanthology.org/2023.nodalida-1.76
DOI:
Bibkey:
Cite (ACL):
Kairit Sirts. 2023. Estonian Named Entity Recognition: New Datasets and Models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 752–761, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Estonian Named Entity Recognition: New Datasets and Models (Sirts, NoDaLiDa 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2023.nodalida-1.76.pdf