Parameter-Efficient Korean Character-Level Language Modeling
Marco Cognetta, Sangwhan Moon, Lawrence Wolf-sonkin, Naoaki Okazaki
Abstract
Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.- Anthology ID:
- 2023.eacl-main.172
- Volume:
- Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Andreas Vlachos, Isabelle Augenstein
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2350–2356
- Language:
- URL:
- https://aclanthology.org/2023.eacl-main.172
- DOI:
- 10.18653/v1/2023.eacl-main.172
- Cite (ACL):
- Marco Cognetta, Sangwhan Moon, Lawrence Wolf-sonkin, and Naoaki Okazaki. 2023. Parameter-Efficient Korean Character-Level Language Modeling. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2350–2356, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Parameter-Efficient Korean Character-Level Language Modeling (Cognetta et al., EACL 2023)
- PDF:
- https://preview.aclanthology.org/insights-reingestion/2023.eacl-main.172.pdf