Haruka Amatani


Design and Evaluation of the Corpus of Everyday Japanese Conversation
Hanae Koiso | Haruka Amatani | Yasuharu Den | Yuriko Iseki | Yuichi Ishimoto | Wakako Kashino | Yoshiko Kawabata | Ken’ya Nishikawa | Yayoi Tanaka | Yasuyuki Usuda | Yuka Watanabe
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We have constructed the Corpus of Everyday Japanese Conversation (CEJC) and published it in March 2022. The CEJC is designed to contain various kinds of everyday conversations in a balanced manner to capture their diversity. The CEJC features not only audio but also video data to facilitate precise understanding of the mechanism of real-life social behavior. The publication of a large-scale corpus of everyday conversations that includes video data is a new approach. The CEJC contains 200 hours of speech, 577 conversations, about 2.4 million words, and a total of 1675 conversants. In this paper, we present an overview of the corpus, including the recording method and devices, structure of the corpus, formats of video and audio files, transcription, and annotations. We then report some results of the evaluation of the CEJC in terms of conversant and conversation attributes. We show that the CEJC includes a good balance of adult conversants in terms of gender and age, as well as a variety of conversations in terms of conversation forms, places, activities, and numbers of conversants.