Incorporating Word Attention into Character-Based Word Segmentation

Shohei Higashiyama, Masao Utiyama, Eiichiro Sumita, Masao Ideuchi, Yoshiaki Oida, Yohei Sakamoto, Isaac Okada


Abstract
Neural network models have been actively applied to word segmentation, especially Chinese, because of the ability to minimize the effort in feature engineering. Typical segmentation models are categorized as character-based, for conducting exact inference, or word-based, for utilizing word-level information. We propose a character-based model utilizing word information to leverage the advantages of both types of models. Our model learns the importance of multiple candidate words for a character on the basis of an attention mechanism, and makes use of it for segmentation decisions. The experimental results show that our model achieves better performance than the state-of-the-art models on both Japanese and Chinese benchmark datasets.
Anthology ID:
N19-1276
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2699–2709
Language:
URL:
https://aclanthology.org/N19-1276
DOI:
10.18653/v1/N19-1276
Bibkey:
Cite (ACL):
Shohei Higashiyama, Masao Utiyama, Eiichiro Sumita, Masao Ideuchi, Yoshiaki Oida, Yohei Sakamoto, and Isaac Okada. 2019. Incorporating Word Attention into Character-Based Word Segmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2699–2709, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Incorporating Word Attention into Character-Based Word Segmentation (Higashiyama et al., NAACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/N19-1276.pdf
Code
 shigashiyama/seikanlp