MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)

Simone Tedeschi; Roberto Navigli

doi:10.18653/v1/2022.findings-naacl.60

MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)

Abstract

Named Entity Recognition (NER) is the task of identifying named entities in texts and classifying them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limiting the general applicability of NER systems. In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres. We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER.In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems. We release our dataset at https://github.com/Babelscape/multinerd.

Anthology ID:: 2022.findings-naacl.60
Volume:: Findings of the Association for Computational Linguistics: NAACL 2022
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 801–812
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2022.findings-naacl.60/
DOI:: 10.18653/v1/2022.findings-naacl.60
Bibkey:
Cite (ACL):: Simone Tedeschi and Roberto Navigli. 2022. MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation). In Findings of the Association for Computational Linguistics: NAACL 2022, pages 801–812, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation) (Tedeschi & Navigli, Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2022.findings-naacl.60.pdf
Video:: https://preview.aclanthology.org/add-emnlp-2024-awards/2022.findings-naacl.60.mp4
Code: babelscape/multinerd
Data: CoNLL 2002, WikiANN, WikiNEuRal

PDF Cite Search Code Video Fix data