Chaofan Wang


2022

pdf
Automatic Word Segmentation and Part-of-Speech Tagging of Ancient Chinese Based on BERT Model
Yu Chang | Peng Zhu | Chaoping Wang | Chaofan Wang
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

In recent years, new deep learning methods and pre-training language models have been emerging in the field of natural language processing (NLP). These methods and models can greatly improve the accuracy of automatic word segmentation and part-of-speech tagging in the field of ancient Chinese research. In these models, the BERT model has made amazing achievements in the top-level test of machine reading comprehension SQuAD-1.1. In addition, it also showed better results than other models in 11 different NLP tests. In this paper, SIKU-RoBERTa pre-training language model based on the high-quality full-text corpus of SiKuQuanShu have been adopted, and part corpus of ZuoZhuan that has been word segmented and part-of-speech tagged is used as training sets to build a deep network model based on BERT for word segmentation and POS tagging experiments. In addition, we also use other classical NLP network models for comparative experiments. The results show that using SIKU-RoBERTa pre-training language model, the overall prediction accuracy of word segmentation and part-of-speech tagging of this model can reach 93.87% and 88.97%, with excellent overall performance.

2020

pdf
融入多尺度特征注意力的胶囊神经网络及其在文本分类中的应用(Incorporating Multi-scale Feature Attention into Capsule Network and its Application in Text Classification)
Chaofan Wang (王超凡) | Shenggen Ju (琚生根) | Jieping Sun (孙界平) | Run Chen (陈润)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

近些年来,胶囊神经网络(Capsnets)由于拥有强大的文本特征学习能力已被应用到了文本分类任务当中。目前的研究工作大部分都将提取到的文本多元语法特征视为同等重要,而忽略了单词所对应各个多元语法特征的重要程度应该是由具体上下文决定的这一问题,这将直接影响到模型对整个文本的语义理解。针对上述问题,本文提出了多尺度特征部分连接胶囊网络(MulPart-Capsnets)。该方法将多尺度特征注意力融入到Capsnets中,多尺度特征注意力能够自动选择不同尺度的多元语法特征,通过对其进行加权求和,就能为每个单词精确捕捉到丰富的多元语法特征。同时,为了减少子胶囊与父胶囊之间的冗余信息传递,本文同时也对路由算法进行了改进。 本文提出的算法在文本分类任务上针对七个著名的数据集进行了有效性验证,和现有的研究工作相比,性能显著提高,说明了本文的算法能够捕获文本中更丰富的多元语法特征,具有更加强大的文本特征学习能力。