基于预训练语言模型的繁体古文自动句读研究(Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-training Language Model)

Xuemei Tang (唐雪梅), Qi Su (苏祺), Jun Wang (王军), Yuhang Chen (陈雨航), Hao Yang (杨浩)


Abstract
未经整理的古代典籍不含任何标点,不符合当代人的阅读习惯,古籍断句标点之后有助于阅读、研究和出版。本文提出了一种基于预训练语言模型的繁体古文自动句读框架。本文整理了约10亿字的繁体古文语料,对于训练语言模型进行增量训练,在此基础上上实现古文自动句读和标点。实验表明经过大规模繁体古文语料增量训练后的语言模型具备更好的古文语义表示能力,能够有助提升繁体古文自动句读和自动标点的效果。融合了增量训练模型之后,古文断句F1值达到95.03%,古文标点F1值达到了80.18%,分别比使用未增量训练的语言模型提升1.83%和2.21%。为解决现有篇章级句读方案效率低的问题,本文改进了前人的串行滑动窗口方式,在一定程度上提高了句读效率,并提出一种新的并行滑动窗口方式,能够高效准确地进行长文本自动句读。
Anthology ID:
2021.ccl-1.61
Volume:
Proceedings of the 20th Chinese National Conference on Computational Linguistics
Month:
August
Year:
2021
Address:
Huhhot, China
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
678–688
Language:
Chinese
URL:
https://aclanthology.org/2021.ccl-1.61
DOI:
Bibkey:
Cite (ACL):
Xuemei Tang, Qi Su, Jun Wang, Yuhang Chen, and Hao Yang. 2021. 基于预训练语言模型的繁体古文自动句读研究(Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-training Language Model). In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 678–688, Huhhot, China. Chinese Information Processing Society of China.
Cite (Informal):
基于预训练语言模型的繁体古文自动句读研究(Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-training Language Model) (Tang et al., CCL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2021.ccl-1.61.pdf