Bovey Yu


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
Development of Community-Oriented Text-to-Speech Models for Māori ‘Avaiki Nui (Cook Islands Māori)
Jesin James | Rolando Coto-Solano | Sally Akevai Nicholas | Joshua Zhu | Bovey Yu | Fuki Babasaki | Jenny Tyler Wang | Nicholas Derby
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper we describe the development of a text-to-speech system for Māori ‘Avaiki Nui (Cook Islands Māori). We provide details about the process of community-collaboration that was followed throughout the project, a continued engagement where we are trying to develop speech and language technology for the benefit of the community. During this process we gathered a group of recordings that we used to train a TTS system. When training we used two approaches, the HMM-system MaryTTS (Schröder et al., 2011) and the deep learning system FastSpeech2 (Ren et al., 2020). We performed two evaluation tasks on the models: First, we measured their quality by having the synthesized speech transcribed by ASR. The human produced ground truth had lower error rates (CER=4.3, WER=18), but the FastSpeech2 audio has lower error rates (CER=11.8 and WER=42.7) than the MaryTTS voice (CER=17.9 and WER=48.1). The second evaluation was a survey amongst speakers of the language so they could judge the voice’s quality. The ground truth was rated with the highest quality (MOS=4.6), but the FastSpeech2 voice had an overall quality of MOS=3.2, which was significantly higher than that of the MaryTTS synthesized recordings (MOS=2.0). We intend to use the FastSpeech2 model to create language learning tools for community members both on the Cook Islands and in the diaspora.