Supervised Morphological Segmentation Using Rich Annotated Lexicon

Ebrahim Ansari, Zdeněk Žabokrtský, Mohammad Mahmoudi, Hamid Haghdoost, Jonáš Vidra


Abstract
Morphological segmentation of words is the process of dividing a word into smaller units called morphemes; it is tricky especially when a morphologically rich or polysynthetic language is under question. In this work, we designed and evaluated several Recurrent Neural Network (RNN) based models as well as various other machine learning based approaches for the morphological segmentation task. We trained our models using annotated segmentation lexicons. To evaluate the effect of the training data size on our models, we decided to create a large hand-annotated morphologically segmented corpus of Persian words, which is, to the best of our knowledge, the first and the only segmentation lexicon for the Persian language. In the experimental phase, using the hand-annotated Persian lexicon and two smaller similar lexicons for Czech and Finnish languages, we evaluated the effect of the training data size, different hyper-parameters settings as well as different RNN-based models.
Anthology ID:
R19-1007
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
52–61
Language:
URL:
https://aclanthology.org/R19-1007
DOI:
10.26615/978-954-452-056-4_007
Bibkey:
Cite (ACL):
Ebrahim Ansari, Zdeněk Žabokrtský, Mohammad Mahmoudi, Hamid Haghdoost, and Jonáš Vidra. 2019. Supervised Morphological Segmentation Using Rich Annotated Lexicon. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 52–61, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Supervised Morphological Segmentation Using Rich Annotated Lexicon (Ansari et al., RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/R19-1007.pdf