Abstract
In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.- Anthology ID:
- L14-1371
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 1471–1476
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/432_Paper.pdf
- DOI:
- Cite (ACL):
- Hanae Koiso, Yasuharu Den, Ken’ya Nishikawa, and Kikuo Maekawa. 2014. Design and development of an RDB version of the Corpus of Spontaneous Japanese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1471–1476, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Design and development of an RDB version of the Corpus of Spontaneous Japanese (Koiso et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/432_Paper.pdf