Eojin Jeon


2022

pdf
Break it Down into BTS: Basic, Tiniest Subword Units for Korean
Nayeon Kim | Jun-Hyung Park | Joon-Young Choi | Eojin Jeon | Youjin Kang | SangKeun Lee
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We introduce Basic, Tiniest Subword (BTS) units for the Korean language, which are inspired by the invention principle of Hangeul, the Korean writing system. Instead of relying on 51 Korean consonant and vowel letters, we form the letters from BTS units by adding strokes or combining them. To examine the impact of BTS units on Korean language processing, we develop a novel BTS-based word embedding framework that is readily applicable to various models. Our experiments reveal that BTS units significantly improve the performance of Korean word embedding on all intrinsic and extrinsic tasks in our evaluation. In particular, BTS-based word embedding outperforms the state-of-theart Korean word embedding by 11.8% in word analogy. We further investigate the unique advantages provided by BTS units through indepth analysis.