CantoMap: a Hong Kong Cantonese MapTask Corpus

Grégoire Winterstein, Carmen Tang, Regine Lai


Abstract
This work reports on the construction of a corpus of connected spoken Hong Kong Cantonese. The corpus aims at providing an additional resource for the study of modern (Hong Kong) Cantonese and also involves several controlled elicitation tasks which will serve different projects related to the phonology and semantics of Cantonese. The word-segmented corpus offers recordings, phonemic transcription, and Chinese characters transcription. The corpus contains a total of 768 minutes of recordings and transcripts of forty speakers. All the audio material has been aligned at utterance level with the transcriptions, using the ELAN transcription and annotation tool. The controlled elicitation task was based on the design of HCRC MapTask corpus (Anderson et al., 1991), in which participants had to communicate using solely verbal means as eye contact was restricted. In this paper, we outline the design of the maps and their landmarks and the basic segmentation principles of the data and various transcription conventions we adopted. We also compare the contents of Cantomap to those of comparable Cantonese corpora.
Anthology ID:
2020.lrec-1.355
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2906–2913
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.355
DOI:
Bibkey:
Cite (ACL):
Grégoire Winterstein, Carmen Tang, and Regine Lai. 2020. CantoMap: a Hong Kong Cantonese MapTask Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2906–2913, Marseille, France. European Language Resources Association.
Cite (Informal):
CantoMap: a Hong Kong Cantonese MapTask Corpus (Winterstein et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.355.pdf