Neural Sequence-to-sequence Learning of Internal Word Structure

Tatyana Ruzsics, Tanja Samardžić

[How to correct problems with metadata yourself]


Abstract
Learning internal word structure has recently been recognized as an important step in various multilingual processing tasks and in theoretical language comparison. In this paper, we present a neural encoder-decoder model for learning canonical morphological segmentation. Our model combines character-level sequence-to-sequence transformation with a language model over canonical segments. We obtain up to 4% improvement over a strong character-level encoder-decoder baseline for three languages. Our model outperforms the previous state-of-the-art for two languages, while eliminating the need for external resources such as large dictionaries. Finally, by comparing the performance of encoder-decoder and classical statistical machine translation systems trained with and without corpus counts, we show that including corpus counts is beneficial to both approaches.
Anthology ID:
K17-1020
Volume:
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Roger Levy, Lucia Specia
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
184–194
Language:
URL:
https://aclanthology.org/K17-1020
DOI:
10.18653/v1/K17-1020
Bibkey:
Cite (ACL):
Tatyana Ruzsics and Tanja Samardžić. 2017. Neural Sequence-to-sequence Learning of Internal Word Structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 184–194, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Neural Sequence-to-sequence Learning of Internal Word Structure (Ruzsics & Samardžić, CoNLL 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/K17-1020.pdf