Context-Aware Word Segmentation for Chinese Real-World Discourse

Kaiyu Huang, Junpeng Liu, Jingxiang Cao, Degen Huang


Abstract
Previous neural approaches achieve significant progress for Chinese word segmentation (CWS) as a sentence-level task, but it suffers from limitations on real-world scenario. In this paper, we address this issue with a context-aware method and optimize the solution at document-level. This paper proposes a three-step strategy to improve the performance for discourse CWS. First, the method utilizes an auxiliary segmenter to remedy the limitation on pre-segmenter. Then the context-aware algorithm computes the confidence of each split. The maximum probability path is reconstructed via this algorithm. Besides, in order to evaluate the performance in discourse, we build a new benchmark consisting of the latest news and Chinese medical articles. Extensive experiments on this benchmark show that our proposed method achieves a competitive performance on a document-level real-world scenario for CWS.
Anthology ID:
2020.iwdp-1.5
Volume:
Proceedings of the Second International Workshop of Discourse Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
iwdp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22–28
Language:
URL:
https://aclanthology.org/2020.iwdp-1.5
DOI:
Bibkey:
Cite (ACL):
Kaiyu Huang, Junpeng Liu, Jingxiang Cao, and Degen Huang. 2020. Context-Aware Word Segmentation for Chinese Real-World Discourse. In Proceedings of the Second International Workshop of Discourse Processing, pages 22–28, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Context-Aware Word Segmentation for Chinese Real-World Discourse (Huang et al., iwdp 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2020.iwdp-1.5.pdf
Dataset:
 2020.iwdp-1.5.Dataset.rar