Hwichan Kim


2022

pdf
Learning How to Translate North Korean through South Korean
Hwichan Kim | Sangwhan Moon | Naoaki Okazaki | Mamoru Komachi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

South and North Korea both use the Korean language. However, Korean NLP research has focused on South Korean only, and existing NLP systems of the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation, and then, investigate automatic alignment methods suitable for North Korean. Finally, we show that a model trained by North Korean bilingual data without human annotation significantly boosts North Korean translation accuracy compared to existing South Korean models in zero-shot settings.

2021

pdf
Can Monolingual Pre-trained Encoder-Decoder Improve NMT for Distant Language Pairs?
Hwichan Kim | Mamoru Komachi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
TMU NMT System with Japanese BART for the Patent task of WAT 2021
Hwichan Kim | Mamoru Komachi
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

In this paper, we introduce our TMU Neural Machine Translation (NMT) system submitted for the Patent task (Korean Japanese and English Japanese) of 8th Workshop on Asian Translation (Nakazawa et al., 2021). Recently, several studies proposed pre-trained encoder-decoder models using monolingual data. One of the pre-trained models, BART (Lewis et al., 2020), was shown to improve translation accuracy via fine-tuning with bilingual data. However, they experimented only Romanian!English translation using English BART. In this paper, we examine the effectiveness of Japanese BART using Japan Patent Office Corpus 2.0. Our experiments indicate that Japanese BART can also improve translation accuracy in both Korean Japanese and English Japanese translations.

2020

pdf
Korean-to-Japanese Neural Machine Translation System using Hanja Information
Hwichan Kim | Tosho Hirasawa | Mamoru Komachi
Proceedings of the 7th Workshop on Asian Translation

In this paper, we describe our TMU neural machine translation (NMT) system submitted for the Patent task (Korean→Japanese) of the 7th Workshop on Asian Translation (WAT 2020, Nakazawa et al., 2020). We propose a novel method to train a Korean-to-Japanese translation model. Specifically, we focus on the vocabulary overlap of Korean Hanja words and Japanese Kanji words, and propose strategies to leverage Hanja information. Our experiment shows that Hanja information is effective within a specific domain, leading to an improvement in the BLEU scores by +1.09 points compared to the baseline.

pdf
Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition
Hwichan Kim | Tosho Hirasawa | Mamoru Komachi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

The primary limitation of North Korean to English translation is the lack of a parallel corpus; therefore, high translation accuracy cannot be achieved. To address this problem, we propose a zero-shot approach using South Korean data, which are remarkably similar to North Korean data. We train a neural machine translation model after tokenizing a South Korean text at the character level and decomposing characters into phonemes.We demonstrate that our method can effectively learn North Korean to English translation and improve the BLEU scores by +1.01 points in comparison with the baseline.