A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction

Rui Liu, Feilong Bao, Guanglai Gao, Hui Zhang, Yonghe Wang


Abstract
In this paper, we first utilize the word embedding that focuses on sub-word units to the Mongolian Phrase Break (PB) prediction task by using Long-Short-Term-Memory (LSTM) model. Mongolian is an agglutinative language. Each root can be followed by several suffixes to form probably millions of words, but the existing Mongolian corpus is not enough to build a robust entire word embedding, thus it suffers a serious data sparse problem and brings a great difficulty for Mongolian PB prediction. To solve this problem, we look at sub-word units in Mongolian word, and encode their information to a meaningful representation, then fed it to LSTM to decode the best corresponding PB label. Experimental results show that the proposed model significantly outperforms traditional CRF model using manually features and obtains 7.49% F-Measure gain.
Anthology ID:
C18-1207
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2448–2455
Language:
URL:
https://aclanthology.org/C18-1207
DOI:
Bibkey:
Cite (ACL):
Rui Liu, Feilong Bao, Guanglai Gao, Hui Zhang, and Yonghe Wang. 2018. A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2448–2455, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction (Liu et al., COLING 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/C18-1207.pdf