Jesin James
2026
TTSVowelViz: A Tool for Visualising Text-to-Speech Model Training via Vowel Spaces
Pasindu Udawatta | Jesin James | Balamurali B T | Catherine Inez Watson | Ake Nicholas | Binu Nisal Abeysinghe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Pasindu Udawatta | Jesin James | Balamurali B T | Catherine Inez Watson | Ake Nicholas | Binu Nisal Abeysinghe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
In text-to-speech (TTS) model training, the saturation of the loss curve indicates how well a model learns the characteristics of the training dataset. But it does not reveal the linguistic properties learned by the model. Existing TTS approaches miss the potential to incorporate linguistic insights into model training. We introduce TTSVowelViz, a novel tool that visualises static and dynamic vowel spaces during model training, bridging linguistic knowledge and TTS model development. It helps identify which vowel sounds are accurately learned and how the vowel spaces are evolved during training. To assess TTSVowelViz, we fine-tuned a TTS model from General American English to New Zealand English and conducted a perception test. Our results show that the formants of specific vowels in the vowel spaces generated by TTSVowelViz align with human perception, effectively visualising the perceived accent shift. This work highlights vowel space visualisation as a valuable interpretability tool for TTS training.
2025
Advocating Character Error Rate for Multilingual ASR Evaluation
Thennal D K | Jesin James | Deepa Padmini Gopinath | Muhammed Ashraf K
Findings of the Association for Computational Linguistics: NAACL 2025
Thennal D K | Jesin James | Deepa Padmini Gopinath | Muhammed Ashraf K
Findings of the Association for Computational Linguistics: NAACL 2025
Automatic speech recognition (ASR) systems have traditionally been evaluated using English datasets, with the word error rate (WER) serving as the predominant metric. WER’s simplicity and ease of interpretation have contributed to its widespread adoption, particularly for English. However, as ASR systems expand to multilingual contexts, WER fails in various ways, particularly with morphologically complex languages or those without clear word boundaries. Our work documents the limitations of WER as an evaluation metric and advocates for the character error rate (CER) as the primary metric in multilingual ASR evaluation. We show that CER avoids many of the challenges WER faces and exhibits greater consistency across writing systems. We support our proposition by conducting human evaluations of ASR transcriptions in three languages—Malayalam, English, and Arabic—which exhibit distinct morphological characteristics. We show that CER correlates more closely with human judgments than WER, even for English. To facilitate further research, we release our human evaluation dataset for future benchmarking of ASR metrics. Our findings suggest that CER should be prioritized, or at least supplemented, in multilingual ASR evaluations to account for the varying linguistic characteristics of different languages.
2024
Development of Community-Oriented Text-to-Speech Models for Māori ‘Avaiki Nui (Cook Islands Māori)
Jesin James | Rolando Coto-Solano | Sally Akevai Nicholas | Joshua Zhu | Bovey Yu | Fuki Babasaki | Jenny Tyler Wang | Nicholas Derby
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Jesin James | Rolando Coto-Solano | Sally Akevai Nicholas | Joshua Zhu | Bovey Yu | Fuki Babasaki | Jenny Tyler Wang | Nicholas Derby
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper we describe the development of a text-to-speech system for Māori ‘Avaiki Nui (Cook Islands Māori). We provide details about the process of community-collaboration that was followed throughout the project, a continued engagement where we are trying to develop speech and language technology for the benefit of the community. During this process we gathered a group of recordings that we used to train a TTS system. When training we used two approaches, the HMM-system MaryTTS (Schröder et al., 2011) and the deep learning system FastSpeech2 (Ren et al., 2020). We performed two evaluation tasks on the models: First, we measured their quality by having the synthesized speech transcribed by ASR. The human produced ground truth had lower error rates (CER=4.3, WER=18), but the FastSpeech2 audio has lower error rates (CER=11.8 and WER=42.7) than the MaryTTS voice (CER=17.9 and WER=48.1). The second evaluation was a survey amongst speakers of the language so they could judge the voice’s quality. The ground truth was rated with the highest quality (MOS=4.6), but the FastSpeech2 voice had an overall quality of MOS=3.2, which was significantly higher than that of the MaryTTS synthesized recordings (MOS=2.0). We intend to use the FastSpeech2 model to create language learning tools for community members both on the Cook Islands and in the diaspora.
2022
Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting
Jesin James | Vithya Yogarajan | Isabella Shields | Catherine Watson | Peter Keegan | Keoni Mahelona | Peter-Lucas Jones
Findings of the Association for Computational Linguistics: NAACL 2022
Jesin James | Vithya Yogarajan | Isabella Shields | Catherine Watson | Peter Keegan | Keoni Mahelona | Peter-Lucas Jones
Findings of the Association for Computational Linguistics: NAACL 2022
Te reo Māori, New Zealand’s only indigenous language, is code-switched with English. Māori speakers are atleast bilingual, and the use of Māori is increasing in New Zealand English. Unfortunately, due to the minimal availability of resources, including digital data, Māori is under-represented in technological advances. Cloud-based multilingual systems such as Google and Microsoft Azure support Māori language detection. However, we provide experimental evidence to show that the accuracy of such systems is low when detecting Māori. Hence, with the support of Māori community, we collect Māori and bilingual data to use natural language processing (NLP) to improve Māori language detection. We train bilingual sub-word embeddings and provide evidence to show that our bilingual embeddings improve overall accuracy compared to the publicly-available monolingual embeddings. This improvement has been verified for various NLP tasks using three bilingual databases containing formal transcripts and informal social media data. We also show that BiLSTM with pre-trained Māori-English sub-word embeddings outperforms large-scale contextual language models such as BERT on down streaming tasks of detecting Māori language. However, this research uses large models ‘as is’ for transfer learning, where no further training was done on Māori-English data. The best accuracy of 87% was obtained using BiLSTM with bilingual embeddings to detect Māori-English code-switching points.