Jae Sung Lee

Other people with similar names: JaeSung Lee, Jaesung Lee

Unverified author pages with similar names: Jae Sung Lee, Jaesung Lee

2026

Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models
DongHyeok Lee | Jeongyeon Park | Kyungbeen Cho | Jae Sung Lee
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Tokenization plays a crucial role in the performance of language models. However, most existing tokenizers rely on frequency-based segmentation, which fails to capture the morphological structure of languages and often leads to inefficient token representations. In this study, we propose a novel tokenization method that emphasizes the importance of Korean morphological structures in eojeol (Korean spacing unit). This method is designed to accommodate both inter-eojeol segmentation and intra-eojeol segmentation, enabling the selection of subwords based on morphemes. We pretrained a language model using the proposed method and evaluated its performance on Korean benchmark tasks. Experimental results demonstrate that the proposed method generally outperforms existing approaches. Notably, it produces significantly fewer tokens per input sequence, indicating its effectiveness and efficiency for Korean language modeling. The code is available at https://github.com/Dohy-Lee/mob.

Co-authors

Venues

EACL1

Fix author