Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

Johan Krause, Igor Shapiro, Tarek Saier, Michael Färber


Abstract
Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.
Anthology ID:
2021.sdp-1.8
Volume:
Proceedings of the Second Workshop on Scholarly Document Processing
Month:
June
Year:
2021
Address:
Online
Editors:
Iz Beltagy, Arman Cohan, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Keith Hall, Drahomira Herrmannova, Petr Knoth, Kyle Lo, Philipp Mayr, Robert M. Patton, Michal Shmueli-Scheuer, Anita de Waard, Kuansan Wang, Lucy Lu Wang
Venue:
sdp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
66–72
Language:
URL:
https://aclanthology.org/2021.sdp-1.8
DOI:
10.18653/v1/2021.sdp-1.8
Bibkey:
Cite (ACL):
Johan Krause, Igor Shapiro, Tarek Saier, and Michael Färber. 2021. Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic. In Proceedings of the Second Workshop on Scholarly Document Processing, pages 66–72, Online. Association for Computational Linguistics.
Cite (Informal):
Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic (Krause et al., sdp 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2021.sdp-1.8.pdf
Code
 illdepence/sdp2021