Integration of Named Entity Recognition and Sentence Segmentation on Ancient Chinese based on Siku-BERT

Sijia Ge


Abstract
Sentence segmentation and named entity recognition are two significant tasks in ancient Chinese processing since punctuation and named entity information are important for further research on ancient classics. These two are sequence labeling tasks in essence so we can tag the labels of these two tasks for each token simultaneously. Our work is to evaluate whether such a unified way would be better than tagging the label of each task separately with a BERT-based model. The paper adopts a BERT-based model that was pre-trained on ancient Chinese text to conduct experiments on Zuozhuan text. The results show there is no difference between these two tagging approaches without concerning the type of entities and punctuation. The ablation experiments show that the punctuation token in the text is useful for NER tasks, and finer tagging sets such as differentiating the tokens that locate at the end of an entity and those are in the middle of an entity could offer a useful feature for NER while impact negatively sentences segmentation with unified tagging.
Anthology ID:
2022.nlp4dh-1.21
Volume:
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Month:
November
Year:
2022
Address:
Taipei, Taiwan
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
167–173
Language:
URL:
https://aclanthology.org/2022.nlp4dh-1.21
DOI:
Bibkey:
Cite (ACL):
Sijia Ge. 2022. Integration of Named Entity Recognition and Sentence Segmentation on Ancient Chinese based on Siku-BERT. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 167–173, Taipei, Taiwan. Association for Computational Linguistics.
Cite (Informal):
Integration of Named Entity Recognition and Sentence Segmentation on Ancient Chinese based on Siku-BERT (Ge, NLP4DH 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.nlp4dh-1.21.pdf