Abstract
This study aims to evaluate three most popular word segmentation tool for a large Traditional Chinese corpus in terms of their efficiency, resource consumption, and cost. Specifically, we compare the performances of Jieba, CKIP, and MONPA on word segmentation, part-of-speech tagging and named entity recognition through extensive experiments. Experimental results show that MONPA using GPU for batch segmentation can greatly reduce the processing time of massive datasets. In addition, its features such as word segmentation, part-of-speech tagging, and named entity recognition are beneficial to downstream applications.- Anthology ID:
- 2022.rocling-1.24
- Volume:
- Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
- Month:
- November
- Year:
- 2022
- Address:
- Taipei, Taiwan
- Venue:
- ROCLING
- SIG:
- Publisher:
- The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
- Note:
- Pages:
- 193–199
- Language:
- Chinese
- URL:
- https://aclanthology.org/2022.rocling-1.24
- DOI:
- Cite (ACL):
- Wen-Chao Yeh, Yu-Lun Hsieh, Yung-Chun Chang, and Wen-Lian Hsu. 2022. Multifaceted Assessments of Traditional Chinese Word Segmentation Tool on Large Corpora. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), pages 193–199, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
- Cite (Informal):
- Multifaceted Assessments of Traditional Chinese Word Segmentation Tool on Large Corpora (Yeh et al., ROCLING 2022)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2022.rocling-1.24.pdf