Abstract
Previous neural approaches achieve significant progress for Chinese word segmentation (CWS) as a sentence-level task, but it suffers from limitations on real-world scenario. In this paper, we address this issue with a context-aware method and optimize the solution at document-level. This paper proposes a three-step strategy to improve the performance for discourse CWS. First, the method utilizes an auxiliary segmenter to remedy the limitation on pre-segmenter. Then the context-aware algorithm computes the confidence of each split. The maximum probability path is reconstructed via this algorithm. Besides, in order to evaluate the performance in discourse, we build a new benchmark consisting of the latest news and Chinese medical articles. Extensive experiments on this benchmark show that our proposed method achieves a competitive performance on a document-level real-world scenario for CWS.- Anthology ID:
- 2020.iwdp-1.5
- Volume:
- Proceedings of the Second International Workshop of Discourse Processing
- Month:
- December
- Year:
- 2020
- Address:
- Suzhou, China
- Venue:
- iwdp
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 22–28
- Language:
- URL:
- https://aclanthology.org/2020.iwdp-1.5
- DOI:
- Cite (ACL):
- Kaiyu Huang, Junpeng Liu, Jingxiang Cao, and Degen Huang. 2020. Context-Aware Word Segmentation for Chinese Real-World Discourse. In Proceedings of the Second International Workshop of Discourse Processing, pages 22–28, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Context-Aware Word Segmentation for Chinese Real-World Discourse (Huang et al., iwdp 2020)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2020.iwdp-1.5.pdf