Li Nguyen

2020

pdf abs
CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus
Li Nguyen | Christopher Bryant
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces the Canberra Vietnamese-English Code-switching corpus (CanVEC), an original corpus of natural mixed speech that we semi-automatically annotated with language information, part of speech (POS) tags and Vietnamese translations. The corpus, which was built to inform a sociolinguistic study on language variation and code-switching, consists of 10 hours of recorded speech (87k tokens) between 45 Vietnamese-English bilinguals living in Canberra, Australia. We describe how we collected and annotated the corpus by pipelining several monolingual toolkits to considerably speed up the annotation process. We also describe how we evaluated the automatic annotations to ensure corpus reliability. We make the corpus available for research purposes.

Co-authors

Christopher Bryant 1

Venues

lrec1