Introducing a Large Corpus of Tokenized Classical Chinese Poems of Tang and Song Dynasties

Chao-Lin Liu, Ti-Yong Zheng, Kuan-Chun Chen, Meng-Han Chung


Abstract
Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.
Anthology ID:
2022.nlp4dh-1.17
Volume:
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Month:
November
Year:
2022
Address:
Taipei, Taiwan
Editors:
Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
135–144
Language:
URL:
https://aclanthology.org/2022.nlp4dh-1.17
DOI:
Bibkey:
Cite (ACL):
Chao-Lin Liu, Ti-Yong Zheng, Kuan-Chun Chen, and Meng-Han Chung. 2022. Introducing a Large Corpus of Tokenized Classical Chinese Poems of Tang and Song Dynasties. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities, pages 135–144, Taipei, Taiwan. Association for Computational Linguistics.
Cite (Informal):
Introducing a Large Corpus of Tokenized Classical Chinese Poems of Tang and Song Dynasties (Liu et al., NLP4DH 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.nlp4dh-1.17.pdf