Xiuying Wang

2017

pdf abs
Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains
Changliang Li | Xiuying Wang
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Corpus is a valuable resource for information retrieval and data-driven natural language processing systems,especially for spoken dialogue research in specific domains. However,there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. In this paper, we build a large-scale and high-quality Chinese corpus, called CSDC (Chinese Spoken Dialogue Corpus). It contains five domains and more than 140 thousand dialogues in all. Each sentence in this corpus is annotated with slot information additionally compared to other corpora. To our best knowledge, this is the largest Chinese spoken dialogue corpus, as well as the first one with slot information. With this corpus, we proposed a method and did a well-designed experiment. The indicative result is reported at last.

Co-authors

Changliang Li 1

Venues

ijcnlp1