Changwei Xu


2021

pdf bib
基于大规模语料库的《古籍汉字分级字表》研究(The Formulation of The graded Chinese character list of ancient books Based on Large-scale Corpus)
Changwei Xu (许长伟) | Minxuan Feng (冯敏萱) | Bin Li (李斌) | Yiguo Yuan (袁义国)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

"《古籍汉字分级字表》是基于大规模古籍文本语料库、为辅助学习者古籍文献阅读而研制的分级字表。该字表填补了古籍字表研究成果的空缺,依据各汉字学习优先级别的不同,实现了古籍汉字的等级划分,目前收录一级字105个,二级字340个,三级字555个。本文介绍了该字表研制的主要依据和基本步骤,并将其与传统识字教材“三百千”及《现代汉语常用字表》进行比较,验证了其收字的合理性。该字表有助于学习者优先掌握古籍文本常用字,提升古籍阅读能力,从而促进中华优秀传统文化的继承与发展。”

2020

pdf bib
Integration of Automatic Sentence Segmentation and Lexical Analysis of Ancient Chinese based on BiLSTM-CRF Model
Ning Cheng | Bin Li | Liming Xiao | Changwei Xu | Sijia Ge | Xingyue Hao | Minxuan Feng
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. Tasks such as lexical analysis need to be based on sentence segmentation because of the reason that a plenty of ancient books are not punctuated. However, step-by-step processing is prone to cause multi-level diffusion of errors. This paper designs and implements an integrated annotation system of sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Research shows that the integration method adopted in ancient Chinese improves the F1-score of sentence segmentation, word segmentation and part of speech tagging. Based on the experimental results of each test set, the F1-score of sentence segmentation reached 78.95, with an average increase of 3.5%; the F1-score of word segmentation reached 85.73%, with an average increase of 0.18%; and the F1-score of part-of-speech tagging reached 72.65, with an average increase of 0.35%.