CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech
Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alex Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman, Thomas Hain, David R. Mortensen, Peter Viechnicki, Shinji Watanabe
Abstract
We present CS-YODAS, a Creative Commons dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching, or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hrs and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.- Anthology ID:
- 2026.lrec-main.456
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 5776–5784
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.456/
- DOI:
- Cite (ACL):
- Brian Yan, Qingzheng Wang, Matthew Wiesner, Anuj Diwan, Olga Iakovenko, Alex Polok, Injy Hamed, Shuichiro Shimizu, Iris Emerman, Thomas Hain, David R. Mortensen, Peter Viechnicki, and Shinji Watanabe. 2026. CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech. International Conference on Language Resources and Evaluation, main:5776–5784.
- Cite (Informal):
- CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech (Yan et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.456.pdf