Abstract
Learning internal word structure has recently been recognized as an important step in various multilingual processing tasks and in theoretical language comparison. In this paper, we present a neural encoder-decoder model for learning canonical morphological segmentation. Our model combines character-level sequence-to-sequence transformation with a language model over canonical segments. We obtain up to 4% improvement over a strong character-level encoder-decoder baseline for three languages. Our model outperforms the previous state-of-the-art for two languages, while eliminating the need for external resources such as large dictionaries. Finally, by comparing the performance of encoder-decoder and classical statistical machine translation systems trained with and without corpus counts, we show that including corpus counts is beneficial to both approaches.- Anthology ID:
- K17-1020
- Volume:
- Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
- Month:
- August
- Year:
- 2017
- Address:
- Vancouver, Canada
- Editors:
- Roger Levy, Lucia Specia
- Venue:
- CoNLL
- SIG:
- SIGNLL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 184–194
- Language:
- URL:
- https://aclanthology.org/K17-1020
- DOI:
- 10.18653/v1/K17-1020
- Cite (ACL):
- Tatyana Ruzsics and Tanja Samardžić. 2017. Neural Sequence-to-sequence Learning of Internal Word Structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 184–194, Vancouver, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Neural Sequence-to-sequence Learning of Internal Word Structure (Ruzsics & Samardžić, CoNLL 2017)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/K17-1020.pdf