Sub-character Neural Language Modelling in Japanese

Viet Nguyen, Julian Brooke, Timothy Baldwin


Abstract
In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.
Anthology ID:
W17-4122
Volume:
Proceedings of the First Workshop on Subword and Character Level Models in NLP
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Manaal Faruqui, Hinrich Schuetze, Isabel Trancoso, Yadollah Yaghoobzadeh
Venue:
SCLeM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
148–153
Language:
URL:
https://aclanthology.org/W17-4122
DOI:
10.18653/v1/W17-4122
Bibkey:
Cite (ACL):
Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character Neural Language Modelling in Japanese. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Sub-character Neural Language Modelling in Japanese (Nguyen et al., SCLeM 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/W17-4122.pdf