Guanglai Gao


2020

pdf
Incorporating Inner-word and Out-word Features for Mongolian Morphological Segmentation
Na Liu | Xiangdong Su | Haoran Zhang | Guanglai Gao | Feilong Bao
Proceedings of the 28th International Conference on Computational Linguistics

Mongolian morphological segmentation is regarded as a crucial preprocessing step in many Mongolian related NLP applications and has received extensive attention. Recently, end-to-end segmentation approaches with long short-term memory networks (LSTM) have achieved excellent results. However, the inner-word features among characters in the word and the out-word features from context are not well utilized in the segmentation process. In this paper, we propose a neural network incorporating inner-word and out-word features for Mongolian morphological segmentation. The network consists of two encoders and one decoder. The inner-word encoder uses the self-attention mechanisms to capture the inner-word features of the target word. The out-word encoder employs a two layers BiLSTM network to extract out-word features in the sentence. Then, the decoder adopts a multi-head double attention layer to fuse the inner-word features and out-word features and produces the segmentation result. The evaluation experiment compares the proposed network with the baselines and explores the effectiveness of the sub-modules.

2018

pdf
A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction
Rui Liu | Feilong Bao | Guanglai Gao | Hui Zhang | Yonghe Wang
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we first utilize the word embedding that focuses on sub-word units to the Mongolian Phrase Break (PB) prediction task by using Long-Short-Term-Memory (LSTM) model. Mongolian is an agglutinative language. Each root can be followed by several suffixes to form probably millions of words, but the existing Mongolian corpus is not enough to build a robust entire word embedding, thus it suffers a serious data sparse problem and brings a great difficulty for Mongolian PB prediction. To solve this problem, we look at sub-word units in Mongolian word, and encode their information to a meaningful representation, then fed it to LSTM to decode the best corresponding PB label. Experimental results show that the proposed model significantly outperforms traditional CRF model using manually features and obtains 7.49% F-Measure gain.

2016

pdf
Mongolian Named Entity Recognition System with Rich Features
Weihua Wang | Feilong Bao | Guanglai Gao
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we first build a manually annotated named entity corpus of Mongolian. Then, we propose three morphological processing methods and study comprehensive features, including syllable features, lexical features, context features, morphological features and semantic features in Mongolian named entity recognition. Moreover, we also evaluate the influence of word cluster features on the system and combine all features together eventually. The experimental result shows that segmenting each suffix into an individual token achieves better results than deleting suffixes or using the suffixes as feature. The system based on segmenting suffixes with all proposed features yields benchmark result of F-measure=84.65 on this corpus.