Abstract
本文运用语料库和统计方法对汉语语体进行特征的计量研究,并进一步实现自动分类任务。首先通过单因素方差分析描述语体特征区别不同语体的作用和功能。其次,选取其中具有区分度的语言要素拟合逻辑回归模型,量化语体表达形式并观察特征对语体构成的重要性,并通过聚类计算得到了语体的范畴分类体系。最后,以具有代表性的机器学习模型为分类器,挖掘不同组合特征的结构对于语体自动分类的影响。得出在“词2n+词类2n+标点符号2n+语言特征”的组合特征上,取得了最好的分类结果,随机森林模型达到97.25%的准确率。- Anthology ID:
- 2021.ccl-1.37
- Volume:
- Proceedings of the 20th Chinese National Conference on Computational Linguistics
- Month:
- August
- Year:
- 2021
- Address:
- Huhhot, China
- Editors:
- Sheng Li (李生), Maosong Sun (孙茂松), Yang Liu (刘洋), Hua Wu (吴华), Kang Liu (刘康), Wanxiang Che (车万翔), Shizhu He (何世柱), Gaoqi Rao (饶高琦)
- Venue:
- CCL
- SIG:
- Publisher:
- Chinese Information Processing Society of China
- Note:
- Pages:
- 398–412
- Language:
- Chinese
- URL:
- https://aclanthology.org/2021.ccl-1.37
- DOI:
- Cite (ACL):
- Qinqing Tai and Gaoqi Rao. 2021. 汉语语体特征的计量与分类研究(A study on the measurement and classification of Chinese stylistic features). In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 398–412, Huhhot, China. Chinese Information Processing Society of China.
- Cite (Informal):
- 汉语语体特征的计量与分类研究(A study on the measurement and classification of Chinese stylistic features) (Tai & Rao, CCL 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2021.ccl-1.37.pdf