Composing Byte-Pair Encodings for Morphological Sequence Classification

Adam Ek; Jean-Philippe Bernardy

Composing Byte-Pair Encodings for Morphological Sequence Classification

Abstract

Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.

Anthology ID:: 2020.udw-1.9
Volume:: Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Marie-Catherine de Marneffe, Miryam de Lhoneux, Joakim Nivre, Sebastian Schuster
Venue:: UDW
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 76–86
Language:
URL:: https://aclanthology.org/2020.udw-1.9
DOI:
Bibkey:
Cite (ACL):: Adam Ek and Jean-Philippe Bernardy. 2020. Composing Byte-Pair Encodings for Morphological Sequence Classification. In Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), pages 76–86, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):: Composing Byte-Pair Encodings for Morphological Sequence Classification (Ek & Bernardy, UDW 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/ml4al-ingestion/2020.udw-1.9.pdf
Code: adamlek/ud-morphological-tagging

PDF Search Code