Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning

Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi


Abstract
For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.
Anthology ID:
N19-1281
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2744–2755
Language:
URL:
https://aclanthology.org/N19-1281
DOI:
10.18653/v1/N19-1281
Bibkey:
Cite (ACL):
Arseny Tolmachev, Daisuke Kawahara, and Sadao Kurohashi. 2019. Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2744–2755, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning (Tolmachev et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/N19-1281.pdf