Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation

Junyoung Lee, Marco Cognetta, Sangwhan Moon, Naoaki Okazaki


Abstract
Subword tokenization, where text is represented in an intermediate form between full words and characters, is ubiquitous in modern NLP due to its ability to represent any input sentence with a small vocabulary. However for Korean, where there are 11,172 base characters (*syllables*) in its alphabet, it is difficult to have a vocabulary large enough to succinctly encode text while fitting within parameter-budget constraints. This motivates us to explore an alternative representation for Korean which relies on the decompositional nature of Korean syllables: a syllable can be uniquely decomposed into a sequence of two or three subcharacters (*jamo*), of which there are only 68.Using jamo as the basis for subword tokenization (e.g., byte-pair encoding) leads to shorter tokenized sequences with fewer vocabulary parameters, exposes the model to sub-syllable-level morphological information, and increases the amount of augmentation gained from subword regularization. We evaluate jamo-level subword tokenization on several Korean translation tasks and find that jamo-level subword models consistently outperform syllable- and byte-level models in low-resource and restricted-vocabulary settings.
Anthology ID:
2025.loresmt-1.8
Volume:
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico, U.S.A.
Editors:
Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:
LoResMT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
66–80
Language:
URL:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.loresmt-1.8/
DOI:
Bibkey:
Cite (ACL):
Junyoung Lee, Marco Cognetta, Sangwhan Moon, and Naoaki Okazaki. 2025. Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation. In Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), pages 66–80, Albuquerque, New Mexico, U.S.A.. Association for Computational Linguistics.
Cite (Informal):
Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation (Lee et al., LoResMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.loresmt-1.8.pdf