Yaroslav Getman
2025
Towards large-scale speech foundation models for a low-resource minority language
Yaroslav Getman | Tamás Grósz | Katri Hiovain-Asikainen | Tommi Lehtonen | Mikko Kurimo
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Yaroslav Getman | Tamás Grósz | Katri Hiovain-Asikainen | Tommi Lehtonen | Mikko Kurimo
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Modern ASR systems require massive amounts of training data. While ASR training data for most languages are scarce and expensive to transcribe, a practical solution is to collect huge amounts of raw untranscribed speech and pre-train the ASR model in a self-supervised manner. Unfortunately, for many low-resource minority languages, even untranscribed speech data are scarce. In this paper, we propose a solution for the Northern Sámi language with 22,400 hours of speech extracted from the Finnish radio and television archives. We evaluated the model performance with different decoding algorithms and examined the models’ internal behavior with interpretation-based techniques.
2024
Collecting Linguistic Resources for Assessing Children’s Pronunciation of Nordic Languages
Anne Marte Haug Olstad | Anna Smolander | Sofia Strömbergsson | Sari Ylinen | Minna Lehtonen | Mikko Kurimo | Yaroslav Getman | Tamás Grósz | Xinwei Cao | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Anne Marte Haug Olstad | Anna Smolander | Sofia Strömbergsson | Sari Ylinen | Minna Lehtonen | Mikko Kurimo | Yaroslav Getman | Tamás Grósz | Xinwei Cao | Torbjørn Svendsen | Giampiero Salvi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper reports on the experience collecting a number of corpora of Nordic languages spoken by children. The aim of the data collection is providing annotated data to develop and evaluate computer assisted pronunciation assessment systems both for non-native children learning a Nordic language (L2) and for L1 children with speech sound disorder (SSD). The paper presents the challenges encountered recording and annotating data for Finnish, Swedish and Norwegian, as well as the ethical considerations related with making this data publicly available. We hope that sharing this experience will encourage others to collect similar data for other languages. Of the different data collections, we were able to make the Norwegian corpus publicly available in the hope that it will serve as a reference in pronunciation assessment research.