Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation

Wei Li, Yuhan Song, Qi Su, Yanqiu Shao


Abstract
Word Segmentation is a fundamental step for understanding Chinese language. Previous neural approaches for unsupervised Chinese Word Segmentation (CWS) only exploits shallow semantic information, which can miss important context. Large scale Pre-trained language models (PLM) have achieved great success in many areas because of its ability to capture the deep contextual semantic relation. In this paper, we propose to take advantage of the deep semantic information embedded in PLM (e.g., BERT) with a self-training manner, which iteratively probes and transforms the semantic information in PLM into explicit word segmentation ability. Extensive experiment results show that our proposed approach achieves state-of-the-art F1 score on two CWS benchmark datasets.
Anthology ID:
2022.findings-acl.310
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3935–3940
Language:
URL:
https://aclanthology.org/2022.findings-acl.310
DOI:
10.18653/v1/2022.findings-acl.310
Bibkey:
Cite (ACL):
Wei Li, Yuhan Song, Qi Su, and Yanqiu Shao. 2022. Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3935–3940, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation (Li et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2022.findings-acl.310.pdf
Software:
 2022.findings-acl.310.software.zip
Code
 liweitj47/bert_unsupervised_word_segmentation
Data
10,000 People - Human Pose Recognition Data