Abstract
The use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typologically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position embeddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no “one-size-fits-all” configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.- Anthology ID:
- N19-1097
- Volume:
- Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
- Month:
- June
- Year:
- 2019
- Address:
- Minneapolis, Minnesota
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 912–932
- Language:
- URL:
- https://aclanthology.org/N19-1097
- DOI:
- 10.18653/v1/N19-1097
- Cite (ACL):
- Yi Zhu, Ivan Vulić, and Anna Korhonen. 2019. A Systematic Study of Leveraging Subword Information for Learning Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 912–932, Minneapolis, Minnesota. Association for Computational Linguistics.
- Cite (Informal):
- A Systematic Study of Leveraging Subword Information for Learning Word Representations (Zhu et al., NAACL 2019)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/N19-1097.pdf
- Code
- cambridgeltl/sw_study