Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training

Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, Min Zhang


Abstract
Chinese sequence labeling tasks are sensitive to word boundaries. Although pretrained language models (PLM) have achieved considerable success in these tasks, current PLMs rarely consider boundary information explicitly. An exception to this is BABERT, which incorporates unsupervised statistical boundary information into Chinese BERT’s pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT’s learning, developing a semi-supervised boundary-aware PLM. To assess PLMs’ ability to encode boundaries, we introduce a novel “Boundary Information Metric” that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT version outperforms the vanilla version, not only in these tasks but also in broader Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs’ boundary awareness.
Anthology ID:
2024.lrec-main.282
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
3179–3191
Language:
URL:
https://aclanthology.org/2024.lrec-main.282
DOI:
Bibkey:
Cite (ACL):
Longhui Zhang, Dingkun Long, Meishan Zhang, Yanzhao Zhang, Pengjun Xie, and Min Zhang. 2024. Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3179–3191, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training (Zhang et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.lrec-main.282.pdf