Yeojoo Jeon


2022

pdf
Developing Language Resources and NLP Tools for the North Korean Language
Arda Akdemir | Yeojoo Jeon | Tetsuo Shibuya
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Since the division of Korea, the two Korean languages have diverged significantly over the last 70 years. However, due to the lack of linguistic source of the North Korean language, there is no DPRK-based language model. Consequently, scholars rely on the Korean language model by utilizing South Korean linguistic data. In this paper, we first present a large-scale dataset for the North Korean language. We use the dataset to train a BERT-based language model, DPRK-BERT. Second, we annotate a subset of this dataset for the sentiment analysis task. Finally, we compare the performance of different language models for masked language modeling and sentiment analysis tasks.