Abstract
This paper presents NER-UK 2.0, a corpus of texts in the Ukrainian language manually annotated for the named entity recognition task. The corpus contains 560 texts of multiple genres, boasting 21,993 entities in total. The annotation scheme covers 13 entity types, namely location, person name, organization, artifact, document, job title, date, time, period, money, percentage, quantity, and miscellaneous. Such a rich set of entities makes the corpus valuable for training named-entity recognition models in various domains, including news, social media posts, legal documents, and procurement contracts. The paper presents an updated baseline solution for named entity recognition in Ukrainian with 0.89 F1. The corpus is the largest of its kind for the Ukrainian language and is available for download.- Anthology ID:
- 2024.unlp-1.4
- Volume:
- Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Mariana Romanyshyn, Nataliia Romanyshyn, Andrii Hlybovets, Oleksii Ignatenko
- Venue:
- UNLP
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 23–29
- Language:
- URL:
- https://aclanthology.org/2024.unlp-1.4
- DOI:
- Cite (ACL):
- Dmytro Chaplynskyi and Mariana Romanyshyn. 2024. Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian. In Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, pages 23–29, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian (Chaplynskyi & Romanyshyn, UNLP 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.unlp-1.4.pdf