Transfer-based Enrichment of a Hungarian Named Entity Dataset

Attila Novák, Borbála Novák


Abstract
In this paper, we present a major update to the first Hungarian named entity dataset, the Szeged NER corpus. We used zero-shot cross-lingual transfer to initialize the enrichment of entity types annotated in the corpus using three neural NER models: two of them based on the English OntoNotes corpus and one based on the Czech Named Entity Corpus finetuned from multilingual neural language models. The output of the models was automatically merged with the original NER annotation, and automatically and manually corrected and further enriched with additional annotation, like qualifiers for various entity types. We present the evaluation of the zero-shot performance of the two OntoNotes-based models and a transformer-based new NER model trained on the training part of the final corpus. We release the corpus and the trained model.
Anthology ID:
2021.ranlp-1.119
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1060–1067
Language:
URL:
https://aclanthology.org/2021.ranlp-1.119
DOI:
Bibkey:
Cite (ACL):
Attila Novák and Borbála Novák. 2021. Transfer-based Enrichment of a Hungarian Named Entity Dataset. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1060–1067, Held Online. INCOMA Ltd..
Cite (Informal):
Transfer-based Enrichment of a Hungarian Named Entity Dataset (Novák & Novák, RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2021.ranlp-1.119.pdf