Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training
Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, Min Zhang
Abstract
Chinese sequence labeling tasks are sensitive to word boundaries. Although pretrained language models (PLM) have achieved considerable success in these tasks, current PLMs rarely consider boundary information explicitly. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT’s pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT’s learning, developing a semi-supervised boundary-aware PLM. To assess PLMs’ ability to encode boundaries, we introduce a novel “Boundary Information Metric” that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT version outperforms the vanilla version, not only in these tasks but also in broader Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs’ boundary awareness.- Anthology ID:
- 2024.lrec-main.282
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 3179–3191
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.282
- DOI:
- Cite (ACL):
- Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, and Min Zhang. 2024. Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3179–3191, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training (Zhang et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.lrec-main.282.pdf