Peter-Lucas Jones


2024

pdf
Solving Failure Modes in the Creation of Trustworthy Language Technologies
Gianna Leoni | Lee Steven | Tūreiti Keith | Keoni Mahelona | Peter-Lucas Jones | Suzanne Duncan
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

To produce high-quality Natural Language Processing (NLP) technologies for low-resource languages, authentic leadership and participation from the low-resource language community is crucial. This reduces chances of bias, surveillance and the inclusion of inaccurate data that can negatively impact output in language technologies. It also ensures that decision-making throughout the pipeline of work centres on the language community rather than only prioritising metrics. The NLP building process involves a range of steps and decisions to ensure the production of successful models and outputs. Rarely does a model perform as expected or desired the first time it is deployed for testing, resulting in the need for re-assessment and re-deployment. This paper discusses the process involved in solving failure modes for a Māori language automatic speech recognition (ASR) model. It explains how the data is curated and how language and data specialists offer unparalleled insight into the debugging process because of their knowledge of the data. This expertise has a significant influence on decision-making to ensure the entire pipeline is embedded in ethical practice and the work is culturally appropriate for the Māori language community thus creating trustworthy language technology.

2022

pdf
Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting
Jesin James | Vithya Yogarajan | Isabella Shields | Catherine Watson | Peter Keegan | Keoni Mahelona | Peter-Lucas Jones
Findings of the Association for Computational Linguistics: NAACL 2022

Te reo Māori, New Zealand’s only indigenous language, is code-switched with English. Māori speakers are atleast bilingual, and the use of Māori is increasing in New Zealand English. Unfortunately, due to the minimal availability of resources, including digital data, Māori is under-represented in technological advances. Cloud-based multilingual systems such as Google and Microsoft Azure support Māori language detection. However, we provide experimental evidence to show that the accuracy of such systems is low when detecting Māori. Hence, with the support of Māori community, we collect Māori and bilingual data to use natural language processing (NLP) to improve Māori language detection. We train bilingual sub-word embeddings and provide evidence to show that our bilingual embeddings improve overall accuracy compared to the publicly-available monolingual embeddings. This improvement has been verified for various NLP tasks using three bilingual databases containing formal transcripts and informal social media data. We also show that BiLSTM with pre-trained Māori-English sub-word embeddings outperforms large-scale contextual language models such as BERT on down streaming tasks of detecting Māori language. However, this research uses large models ‘as is’ for transfer learning, where no further training was done on Māori-English data. The best accuracy of 87% was obtained using BiLSTM with bilingual embeddings to detect Māori-English code-switching points.

pdf
Developing a Part-Of-Speech tagger for te reo Māori
Aoife Finn | Peter-Lucas Jones | Keoni Mahelona | Suzanne Duncan | Gianna Leoni
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper discusses the development of a Part-of-Speech tagger for te reo Māori which is the Indigenous language of Aotearoa, also known as New Zealand, see Morrison. Henceforth, Part-of-Speech will be referred to as POS throughout this paper and te reo Māori will be referred to as Māori, while Universal Dependencies will be referred to as UD. Prior to the development of this tagger, there was no POS tagger for Māori from Aotearoa. POS taggers tag words according to their syntactic or grammatical category. However, many traditional syntactic categories, and by consequence POS labels, do not “work for” Māori. By this we mean that, for some of the traditional categories, The definition of, or guidelines for, an existing category is not suitable for Māori. They do not have an existing category for certain word classes of Māori. They do not reflect a Māori worldview of the Māori language. We wanted a tagset that is usable with industry-wide tools, but we also needed a tagset that would meet the needs of Māori. Therefore, we based our tagset and guidelines on the UD tagset and tagging conventions, however the categorization of words has been significantly altered to be appropriate for Māori. This is because at the time of development of our POS tagger, the UD conventions had still not been used to tag a Polyneisan language such as Māori, nor did it provide any guidelines about how to tag them. To that end, we worked with highly-proficient, specially-selected Māori speakers and linguists who are specialists in Māori. This has ensured that our POS labels and guidelines conventions faithfully reflect a Māori speaker’s conceptualization of their language.

pdf
Annotating “Particles” in Multiword Expressions in te reo Māori for a Part-of-Speech Tagger
Aoife Finn | Suzanne Duncan | Peter-Lucas Jones | Gianna Leoni | Keoni Mahelona
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

This paper discusses the development of a Part-of-Speech tagger for te reo Māori, which is the Indigenous language of Aotearoa, also known as New Zealand. Te reo Māori is a particularly analytical and polysemic language. A word class called “particles” is introduced, they are small multi-functional words with many meanings, for example ē, ai, noa, rawa, mai, anō and koa. These “particles” are reflective of the analytical and polysemous nature of te reo Māori. They frequently occur both singularly and also in multiword expressions, including time adverbial phrases. The paper illustrates the challenges that they presented to part-of-speech tagging. It also discusses how we overcome these challenges in a way that is appropriate for te reo Māori, given its status an Indigenous language and history of colonisation. This includes a discussion of the importance of accurately reflecting the conceptualization of te reo Māori. And how this involved making no linguistic presumptions, and of eliciting faithful judgements from speakers, in a way that is uninfluenced by linguistic terminology.