Olga Iakovenko
2026
CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech
Brian Yan | Qingzheng Wang | Matthew Wiesner | Anuj Diwan | Olga Iakovenko | Alex Polok | Injy Hamed | Shuichiro Shimizu | Iris Emerman | Thomas Hain | David R. Mortensen | Peter Viechnicki | Shinji Watanabe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Brian Yan | Qingzheng Wang | Matthew Wiesner | Anuj Diwan | Olga Iakovenko | Alex Polok | Injy Hamed | Shuichiro Shimizu | Iris Emerman | Thomas Hain | David R. Mortensen | Peter Viechnicki | Shinji Watanabe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present CS-YODAS, a Creative Commons dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching, or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hrs and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.
2024
Methods of Automatic Matrix Language Determination for Code-Switched Speech
Olga Iakovenko | Thomas Hain
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Olga Iakovenko | Thomas Hain
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.