Abstract
Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.- Anthology ID:
- 2020.udw-1.9
- Volume:
- Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Marie-Catherine de Marneffe, Miryam de Lhoneux, Joakim Nivre, Sebastian Schuster
- Venue:
- UDW
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 76–86
- Language:
- URL:
- https://aclanthology.org/2020.udw-1.9
- DOI:
- Cite (ACL):
- Adam Ek and Jean-Philippe Bernardy. 2020. Composing Byte-Pair Encodings for Morphological Sequence Classification. In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), pages 76–86, Barcelona, Spain (Online). Association for Computational Linguistics.
- Cite (Informal):
- Composing Byte-Pair Encodings for Morphological Sequence Classification (Ek & Bernardy, UDW 2020)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2020.udw-1.9.pdf
- Code
- adamlek/ud-morphological-tagging