Abstract
As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.- Anthology ID:
- 2020.lrec-1.549
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4458–4465
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.549
- DOI:
- Cite (ACL):
- Helena Hubková, Pavel Kral, and Eva Pettersson. 2020. Czech Historical Named Entity Corpus v 1.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4458–4465, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Czech Historical Named Entity Corpus v 1.0 (Hubková et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2020.lrec-1.549.pdf