Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting

Jesin James, Vithya Yogarajan, Isabella Shields, Catherine Watson, Peter Keegan, Keoni Mahelona, Peter-Lucas Jones


Abstract
Te reo Māori, New Zealand’s only indigenous language, is code-switched with English. Māori speakers are atleast bilingual, and the use of Māori is increasing in New Zealand English. Unfortunately, due to the minimal availability of resources, including digital data, Māori is under-represented in technological advances. Cloud-based multilingual systems such as Google and Microsoft Azure support Māori language detection. However, we provide experimental evidence to show that the accuracy of such systems is low when detecting Māori. Hence, with the support of Māori community, we collect Māori and bilingual data to use natural language processing (NLP) to improve Māori language detection. We train bilingual sub-word embeddings and provide evidence to show that our bilingual embeddings improve overall accuracy compared to the publicly-available monolingual embeddings. This improvement has been verified for various NLP tasks using three bilingual databases containing formal transcripts and informal social media data. We also show that BiLSTM with pre-trained Māori-English sub-word embeddings outperforms large-scale contextual language models such as BERT on down streaming tasks of detecting Māori language. However, this research uses large models ‘as is’ for transfer learning, where no further training was done on Māori-English data. The best accuracy of 87% was obtained using BiLSTM with bilingual embeddings to detect Māori-English code-switching points.
Anthology ID:
2022.findings-naacl.49
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
650–660
Language:
URL:
https://aclanthology.org/2022.findings-naacl.49
DOI:
10.18653/v1/2022.findings-naacl.49
Bibkey:
Cite (ACL):
Jesin James, Vithya Yogarajan, Isabella Shields, Catherine Watson, Peter Keegan, Keoni Mahelona, and Peter-Lucas Jones. 2022. Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 650–660, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting (James et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2022.findings-naacl.49.pdf
Video:
 https://preview.aclanthology.org/auto-file-uploads/2022.findings-naacl.49.mp4