Abstract
Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.- Anthology ID:
- D17-1072
- Volume:
- Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
- Month:
- September
- Year:
- 2017
- Address:
- Copenhagen, Denmark
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 692–703
- Language:
- URL:
- https://aclanthology.org/D17-1072
- DOI:
- 10.18653/v1/D17-1072
- Cite (ACL):
- Chen Gong, Zhenghua Li, Min Zhang, and Xinzhou Jiang. 2017. Multi-Grained Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 692–703, Copenhagen, Denmark. Association for Computational Linguistics.
- Cite (Informal):
- Multi-Grained Chinese Word Segmentation (Gong et al., EMNLP 2017)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/D17-1072.pdf