2023
pdf
bib
abs
Automated speech recognition of Indonesian-English language lessons on YouTube using transfer learning
Zara Maxwell-Smith
|
Ben Foley
Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Experiments to fine-tune large multilingual models with limited data from a specific domain or setting has potential to improve automatic speech recognition (ASR) outcomes. This paper reports on the use of the Elpis ASR pipeline to fine-tune two pre-trained base models, Wav2Vec2-XLSR-53 and Wav2Vec2-Large-XLSR-Indonesian, with various mixes of data from 3 YouTube channels teaching Indonesian with English as the language of instruction. We discuss our results inferring new lesson audio (22-46% word error rate) in the context of speeding data collection in diverse and specialised settings. This study is an example of how ASR can be used to accelerate natural language research, expanding ethically sourced data in low-resource settings.
2022
pdf
abs
Scoping natural language processing in Indonesian and Malay for education applications
Zara Maxwell-Smith
|
Michelle Kohler
|
Hanna Suominen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Indonesian and Malay are underrepresented in the development of natural language processing (NLP) technologies and available resources are difficult to find. A clear picture of existing work can invigorate and inform how researchers conceptualise worthwhile projects. Using an education sector project to motivate the study, we conducted a wide-ranging overview of Indonesian and Malay human language technologies and corpus work. We charted 657 included studies according to Hirschberg and Manning’s 2015 description of NLP, concluding that the field was dominated by exploratory corpus work, machine reading of text gathered from the Internet, and sentiment analysis. In this paper, we identify most published authors and research hubs, and make a number of recommendations to encourage future collaboration and efficiency within NLP in Indonesian and Malay.
2021
pdf
abs
Developing ASR for Indonesian-English Bilingual Language Teaching
Zara Maxwell-Smith
|
Ben Foley
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Usage-based analyses of teacher corpora and code-switching (Boztepe, 2003) are an important next stage in understanding language acquisition. Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. Using quantitative methods to understand language learning and teaching is difficult work as the ‘transcription bottleneck’ constrains the size of datasets. We found that using an automatic speech recognition (ASR) toolkit with a small set of training data is likely to speed data collection in this context (Maxwelll-Smith et al., 2020).
pdf
Fossicking in dominant language teaching: Javanese and Indonesian ‘low’ varieties in language teaching resources
Zara Maxwell-Smith
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
2020
pdf
abs
Applications of Natural Language Processing in Bilingual Language Teaching: An Indonesian-English Case Study
Zara Maxwell-Smith
|
Simón González Ochoa
|
Ben Foley
|
Hanna Suominen
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. In this paper, we set out methodological considerations of using automated speech recognition to build a corpus of teacher speech in an Indonesian language classroom. Our preliminary results (64% word error rate) suggest these tools have the potential to speed data collection in this context. We provide practical examples of our data structure, details of our piloted computer-assisted processes, and fine-grained error analysis. Our study is informed and directed by genuine research questions and discussion in both the education and computational linguistics fields. We highlight some of the benefits and risks of using these emerging technologies to analyze the complex work of language teachers and in education more generally.