A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains

Juae Kim, Sunjae Kwon, Youngjoong Ko, Jungyun Seo


Abstract
Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule-based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because an-notating a biomedical corpus for ma-chine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of bio-medical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.
Anthology ID:
W17-5807
Volume:
Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–51
Language:
URL:
https://aclanthology.org/W17-5807
DOI:
Bibkey:
Cite (ACL):
Juae Kim, Sunjae Kwon, Youngjoong Ko, and Jungyun Seo. 2017. A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017), pages 47–51, Taipei, Taiwan. Association for Computational Linguistics.
Cite (Informal):
A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains (Kim et al., 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W17-5807.pdf