Abstract
Word Segmentation is a fundamental step for understanding Chinese language. Previous neural approaches for unsupervised Chinese Word Segmentation (CWS) only exploits shallow semantic information, which can miss important context. Large scale Pre-trained language models (PLM) have achieved great success in many areas because of its ability to capture the deep contextual semantic relation. In this paper, we propose to take advantage of the deep semantic information embedded in PLM (e.g., BERT) with a self-training manner, which iteratively probes and transforms the semantic information in PLM into explicit word segmentation ability. Extensive experiment results show that our proposed approach achieves state-of-the-art F1 score on two CWS benchmark datasets.- Anthology ID:
- 2022.findings-acl.310
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2022
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3935–3940
- Language:
- URL:
- https://aclanthology.org/2022.findings-acl.310
- DOI:
- 10.18653/v1/2022.findings-acl.310
- Cite (ACL):
- Wei Li, Yuhan Song, Qi Su, and Yanqiu Shao. 2022. Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3935–3940, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation (Li et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2022.findings-acl.310.pdf
- Code
- liweitj47/bert_unsupervised_word_segmentation
- Data
- 10,000 People - Human Pose Recognition Data