Abstract
In this paper, we first utilize the word embedding that focuses on sub-word units to the Mongolian Phrase Break (PB) prediction task by using Long-Short-Term-Memory (LSTM) model. Mongolian is an agglutinative language. Each root can be followed by several suffixes to form probably millions of words, but the existing Mongolian corpus is not enough to build a robust entire word embedding, thus it suffers a serious data sparse problem and brings a great difficulty for Mongolian PB prediction. To solve this problem, we look at sub-word units in Mongolian word, and encode their information to a meaningful representation, then fed it to LSTM to decode the best corresponding PB label. Experimental results show that the proposed model significantly outperforms traditional CRF model using manually features and obtains 7.49% F-Measure gain.- Anthology ID:
- C18-1207
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2448–2455
- Language:
- URL:
- https://aclanthology.org/C18-1207
- DOI:
- Cite (ACL):
- Rui Liu, Feilong Bao, Guanglai Gao, Hui Zhang, and Yonghe Wang. 2018. A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2448–2455, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction (Liu et al., COLING 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/C18-1207.pdf