HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Haneul Yoo; Jiho Jin; Juhee Son; JinYeong Bak; Kyunghyun Cho; Alice Oh

doi:10.18653/v1/2022.findings-naacl.140

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Haneul Yoo, Jiho Jin, Juhee Son, JinYeong Bak, Kyunghyun Cho, Alice Oh

Abstract

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

Anthology ID:: 2022.findings-naacl.140
Volume:: Findings of the Association for Computational Linguistics: NAACL 2022
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1832–1844
Language:
URL:: https://aclanthology.org/2022.findings-naacl.140
DOI:: 10.18653/v1/2022.findings-naacl.140
Bibkey:
Cite (ACL):: Haneul Yoo, Jiho Jin, Juhee Son, JinYeong Bak, Kyunghyun Cho, and Alice Oh. 2022. HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1832–1844, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea (Yoo et al., Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2022.findings-naacl.140.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-1/2022.findings-naacl.140.mp4
Code: haneul-yoo/hue

PDF Search Code Video