Jeremie Boudreau


2020

pdf
Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi’kmaq Language Modelling
Jeremie Boudreau | Akankshya Patra | Ashima Suvarna | Paul Cook
Proceedings of the Twelfth Language Resources and Evaluation Conference

Mi’kmaq is an Indigenous language spoken primarily in Eastern Canada. It is polysynthetic and low-resource. In this paper we consider a range of n-gram and RNN language models for Mi’kmaq. We find that an RNN language model, initialized with pre-trained fastText embeddings, performs best, highlighting the importance of sub-word information for Mi’kmaq language modelling. We further consider approaches to language modelling that incorporate cross-lingual word embeddings, but do not see improvements with these models. Finally we consider language models that operate over segmentations produced by SentencePiece — which include sub-word units as tokens — as opposed to word-level models. We see improvements for this approach over word-level language models, again indicating that sub-word modelling is important for Mi’kmaq language modelling.