Nancy Yiu


2020

pdf
SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English
Khia A. Johnson | Molly Babel | Ivan Fong | Nancy Yiu
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes the design, collection, orthographic transcription, and phonetic annotation of SpiCE, a new corpus of conversational Cantonese-English bilingual speech recorded in Vancouver, Canada. The corpus includes high-quality recordings of 34 early bilinguals in both English and Cantonese—to date, 27 have been recorded for a total of 19 hours of participant speech. Participants completed a sentence reading task, storyboard narration, and conversational interview in each language. Transcription and annotation for the corpus are currently underway. Transcripts produced with Google Cloud Speech-to-Text are available for all participants, and will be included in the initial SpiCE corpus release. Hand-corrected orthographic transcripts and force-aligned phonetic transcripts will be released periodically, and upon completion for all recordings, comprise the second release of the corpus. As an open-access language resource, SpiCE will promote bilingualism research for a typologically distinct pair of languages, of which Cantonese remains understudied despite there being millions of speakers around the world. The SpiCE corpus is especially well-suited for phonetic research on conversational speech, and enables researchers to study cross-language within-speaker phenomena for a diverse group of early Cantonese-English bilinguals. These are areas with few existing high-quality resources.