Abstract
In this paper, we present a major update to the first Hungarian named entity dataset, the Szeged NER corpus. We used zero-shot cross-lingual transfer to initialize the enrichment of entity types annotated in the corpus using three neural NER models: two of them based on the English OntoNotes corpus and one based on the Czech Named Entity Corpus finetuned from multilingual neural language models. The output of the models was automatically merged with the original NER annotation, and automatically and manually corrected and further enriched with additional annotation, like qualifiers for various entity types. We present the evaluation of the zero-shot performance of the two OntoNotes-based models and a transformer-based new NER model trained on the training part of the final corpus. We release the corpus and the trained model.- Anthology ID:
- 2021.ranlp-1.119
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
- Month:
- September
- Year:
- 2021
- Address:
- Held Online
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 1060–1067
- Language:
- URL:
- https://aclanthology.org/2021.ranlp-1.119
- DOI:
- Cite (ACL):
- Attila Novák and Borbála Novák. 2021. Transfer-based Enrichment of a Hungarian Named Entity Dataset. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1060–1067, Held Online. INCOMA Ltd..
- Cite (Informal):
- Transfer-based Enrichment of a Hungarian Named Entity Dataset (Novák & Novák, RANLP 2021)
- PDF:
- https://preview.aclanthology.org/landing_page/2021.ranlp-1.119.pdf