On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

Yi Zhu, Benjamin Heinzerling, Ivan Vulić, Michael Strube, Roi Reichart, Anna Korhonen


Abstract
Recent work has validated the importance of subword information for word representation learning. Since subwords increase parameter sharing ability in neural models, their value should be even more pronounced in low-data regimes. In this work, we therefore provide a comprehensive analysis focused on the usefulness of subwords for word representation learning in truly low-resource scenarios and for three representative morphological tasks: fine-grained entity typing, morphological tagging, and named entity recognition. We conduct a systematic study that spans several dimensions of comparison: 1) type of data scarcity which can stem from the lack of task-specific training data, or even from the lack of unannotated data required to train word embeddings, or both; 2) language type by working with a sample of 16 typologically diverse languages including some truly low-resource ones (e.g. Rusyn, Buryat, and Zulu); 3) the choice of the subword-informed word representation method. Our main results show that subword-informed models are universally useful across all language types, with large gains over subword-agnostic embeddings. They also suggest that the effective use of subwords largely depends on the language (type) and the task at hand, as well as on the amount of available data for training the embeddings and task-based models, where having sufficient in-task data is a more critical requirement.
Anthology ID:
K19-1021
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Mohit Bansal, Aline Villavicencio
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
216–226
Language:
URL:
https://aclanthology.org/K19-1021
DOI:
10.18653/v1/K19-1021
Bibkey:
Cite (ACL):
Yi Zhu, Benjamin Heinzerling, Ivan Vulić, Michael Strube, Roi Reichart, and Anna Korhonen. 2019. On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 216–226, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages (Zhu et al., CoNLL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/K19-1021.pdf
Data
Universal Dependencies