Peter Viechnicki
2026
CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech
Brian Yan | Qingzheng Wang | Matthew Wiesner | Anuj Diwan | Olga Iakovenko | Alex Polok | Injy Hamed | Shuichiro Shimizu | Iris Emerman | Thomas Hain | David R. Mortensen | Peter Viechnicki | Shinji Watanabe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Brian Yan | Qingzheng Wang | Matthew Wiesner | Anuj Diwan | Olga Iakovenko | Alex Polok | Injy Hamed | Shuichiro Shimizu | Iris Emerman | Thomas Hain | David R. Mortensen | Peter Viechnicki | Shinji Watanabe
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present CS-YODAS, a Creative Commons dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching, or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hrs and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.
Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech
Kaavya Chaparala | Thomas Thebaud | Jesus Villalba Lopez | Laureano Moro-Velazquez | Peter Viechnicki | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kaavya Chaparala | Thomas Thebaud | Jesus Villalba Lopez | Laureano Moro-Velazquez | Peter Viechnicki | Najim Dehak
Proceedings of the Fifteenth Language Resources and Evaluation Conference
There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.
2025
ParaBLoCC: Parallel Basic Locative Constructions Corpus
Peter Viechnicki | Anthony Kostacos
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Peter Viechnicki | Anthony Kostacos
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
We introduce ParaBLoCC, the Parallel Basic Locative Construction Corpus, the first multilingual compendium of this important grammatico-functional construction, and particularly the first such corpus containing semantically equivalent BLCs in source/target language pairs. The data – taken from bitext corpora in English paired with twenty-six typologically diverse languages – are likely to prove useful for studying questions of cognitive underpinnings and cross-linguistic usage patterns of spatial expressions, as well as for improving multilingual spatial relation extraction and related tasks. The data are being made available at https://github.com/pviechnicki/parablocc.
2024
Large-Scale Bitext Corpora Provide New Evidence for Cognitive Representations of Spatial Terms
Peter Viechnicki | Kevin Duh | Anthony Kostacos | Barbara Landau
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Peter Viechnicki | Kevin Duh | Anthony Kostacos | Barbara Landau
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent evidence from cognitive science suggests that there exist two classes of cognitive representations within the spatial terms of a language, one represented geometrically (e.g., above, below) and the other functionally (e.g., on, in). It has been hypothesized that geometric terms are more constrained and are mastered relatively early in language learning, whereas functional terms are less constrained and are mastered over longer time periods (Landau, 2016). One consequence of this hypothesis is that these two classes should exhibit different cross-linguistic variability, which is supported by human elicitation studies. In this work we present to our knowledge the first corpus-based empirical test of this hypothesis. We develop a pipeline for extracting, isolating, and aligning spatial terms in basic locative constructions from parallel text. Using Shannon entropy to measure the variability of spatial term use across eight languages, we find supporting evidence that variability in functional terms differs significantly from that of geometric terms. We also perform latent variable modeling and find support for the division of spatial terms into geometric and functional classes.