CWSeg: An Efficient and General Approach to Chinese Word Segmentation

Dedong Li, Rui Zhao, Fei Tan


Abstract
In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment in different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas this paradigm also poses challenges in the deployment. It includes the balance between performance and cost, segmentation ambiguity due to domain diversity and vague words boundary, and multi-grained segmentation. In this context, we propose a simple yet effective approach, namely CWSeg, to augment PLM-based schemes by developing cohort training and versatile decoding strategies. Extensive experiments on benchmark datasets demonstrate the efficiency and generalization of our approach. The corresponding segmentation system is also implemented for practical usage and the demo is recorded.
Anthology ID:
2023.acl-industry.1
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Sunayana Sitaram, Beata Beigman Klebanov, Jason D Williams
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/2023.acl-industry.1
DOI:
10.18653/v1/2023.acl-industry.1
Bibkey:
Cite (ACL):
Dedong Li, Rui Zhao, and Fei Tan. 2023. CWSeg: An Efficient and General Approach to Chinese Word Segmentation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 1–10, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
CWSeg: An Efficient and General Approach to Chinese Word Segmentation (Li et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.acl-industry.1.pdf