NerKor+Cars-OntoNotes++

Attila Novák, Borbála Novák


Abstract
In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.
Anthology ID:
2022.lrec-1.203
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1907–1916
Language:
URL:
https://aclanthology.org/2022.lrec-1.203
DOI:
Bibkey:
Cite (ACL):
Attila Novák and Borbála Novák. 2022. NerKor+Cars-OntoNotes++. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1907–1916, Marseille, France. European Language Resources Association.
Cite (Informal):
NerKor+Cars-OntoNotes++ (Novák & Novák, LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2022.lrec-1.203.pdf
Code
 ppke-nlpg/nytk-nerkor-cars-ontonotespp
Data
DaN+GENIANNE