Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi’kmaq Language Modelling

Jeremie Boudreau, Akankshya Patra, Ashima Suvarna, Paul Cook


Abstract
Mi’kmaq is an Indigenous language spoken primarily in Eastern Canada. It is polysynthetic and low-resource. In this paper we consider a range of n-gram and RNN language models for Mi’kmaq. We find that an RNN language model, initialized with pre-trained fastText embeddings, performs best, highlighting the importance of sub-word information for Mi’kmaq language modelling. We further consider approaches to language modelling that incorporate cross-lingual word embeddings, but do not see improvements with these models. Finally we consider language models that operate over segmentations produced by SentencePiece — which include sub-word units as tokens — as opposed to word-level models. We see improvements for this approach over word-level language models, again indicating that sub-word modelling is important for Mi’kmaq language modelling.
Anthology ID:
2020.lrec-1.333
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2736–2745
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.333
DOI:
Bibkey:
Cite (ACL):
Jeremie Boudreau, Akankshya Patra, Ashima Suvarna, and Paul Cook. 2020. Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi’kmaq Language Modelling. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2736–2745, Marseille, France. European Language Resources Association.
Cite (Informal):
Evaluating the Impact of Sub-word Information and Cross-lingual Word Embeddings on Mi’kmaq Language Modelling (Boudreau et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.lrec-1.333.pdf