Abstract
This paper introduces the Canberra Vietnamese-English Code-switching corpus (CanVEC), an original corpus of natural mixed speech that we semi-automatically annotated with language information, part of speech (POS) tags and Vietnamese translations. The corpus, which was built to inform a sociolinguistic study on language variation and code-switching, consists of 10 hours of recorded speech (87k tokens) between 45 Vietnamese-English bilinguals living in Canberra, Australia. We describe how we collected and annotated the corpus by pipelining several monolingual toolkits to considerably speed up the annotation process. We also describe how we evaluated the automatic annotations to ensure corpus reliability. We make the corpus available for research purposes.- Anthology ID:
- 2020.lrec-1.507
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4121–4129
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.507
- DOI:
- Cite (ACL):
- Li Nguyen and Christopher Bryant. 2020. CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4121–4129, Marseille, France. European Language Resources Association.
- Cite (Informal):
- CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus (Nguyen & Bryant, LREC 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.507.pdf