Abstract
Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule-based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because an-notating a biomedical corpus for ma-chine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of bio-medical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.- Anthology ID:
- W17-5807
- Volume:
- Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017)
- Month:
- November
- Year:
- 2017
- Address:
- Taipei, Taiwan
- Venue:
- WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 47–51
- Language:
- URL:
- https://aclanthology.org/W17-5807
- DOI:
- Cite (ACL):
- Juae Kim, Sunjae Kwon, Youngjoong Ko, and Jungyun Seo. 2017. A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017), pages 47–51, Taipei, Taiwan. Association for Computational Linguistics.
- Cite (Informal):
- A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains (Kim et al., 2017)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W17-5807.pdf