2020
pdf
abs
KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora
Teruaki Oka
|
Yuichi Ishimoto
|
Yutaka Yagi
|
Takenori Nakamura
|
Masayuki Asahara
|
Kikuo Maekawa
|
Toshinobu Ogiso
|
Hanae Koiso
|
Kumiko Sakoda
|
Nobuko Kibe
Proceedings of the Twelfth Language Resources and Evaluation Conference
The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.
2016
pdf
abs
‘BonTen’ – Corpus Concordance System for ‘NINJAL Web Japanese Corpus’
Masayuki Asahara
|
Kazuya Kawahara
|
Yuya Takei
|
Hideto Masuoka
|
Yasuko Ohba
|
Yuki Torii
|
Toru Morii
|
Yuki Tanaka
|
Kikuo Maekawa
|
Sachi Kato
|
Hikari Konishi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations
The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named ‘BonTen’ which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.
2014
pdf
abs
Design and development of an RDB version of the Corpus of Spontaneous Japanese
Hanae Koiso
|
Yasuharu Den
|
Ken’ya Nishikawa
|
Kikuo Maekawa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.
pdf
bib
BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text
Masayuki Asahara
|
Sachi Kato
|
Hikari Konishi
|
Mizuho Imada
|
Kikuo Maekawa
International Journal of Computational Linguistics & Chinese Language Processing, Volume 19, Number 3, September 2014
2013
pdf
BCCWJ-TimeBank: Temporal and Event Information Annotation on Japanese Text
Masayuki Asahara
|
Sachi Yasuda
|
Hikari Konishi
|
Mizuho Imada
|
Kikuo Maekawa
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)
2012
pdf
abs
Prediction of Non-Linguistic Information of Spontaneous Speech from the Prosodic Annotation: Evaluation of the X-JToBI system
Kikuo Maekawa
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Speakers' gender and age-group were predicted using the symbolic information of the X-JToBI prosodic labelling scheme as applied to the Core of the Corpus of Spontaneous Japanese (44 hours, 155 speakers, 201 talks). The correct prediction rate of speaker gender by means of logistic regression analysis was about 80%, and, the correct discrimination rate of speaker age-group (4 groups) by means of linear discriminant analysis was about 50 %. These results, in conjunction with the previously reported result of the prediction experiment of 4 speech registers from the X-JToBI information, shows convincingly the superiority of X-JToBI over the traditional J_ToBI. Clarification of the mechanism by which gender- and/or age-group information were reflected in the symbolic representations of prosody largely remains as open question, although some preliminary analyses were presented in the current paper.
2010
pdf
abs
Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Kikuo Maekawa
|
Makoto Yamazaki
|
Takehiko Maruyama
|
Masaya Yamaguchi
|
Hideki Ogura
|
Wakako Kashino
|
Toshinobu Ogiso
|
Hanae Koiso
|
Yasuharu Den
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of word. Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.
pdf
abs
Two-level Annotation of Utterance-units in Japanese Dialogs: An Empirically Emerged Scheme
Yasuharu Den
|
Hanae Koiso
|
Takehiko Maruyama
|
Kikuo Maekawa
|
Katsuya Takanashi
|
Mika Enomoto
|
Nao Yoshida
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we propose a scheme for annotating utterance-level units in Japanese dialogs, which emerged from an analysis of the interrelationship among four schemes, i) inter-pausal units, ii) intonation units, iii) clause units, and iv) pragmatic units. The associations among the labels of these four units were illustrated by multiple correspondence analysis and hierarchical cluster analysis. Based on these results, we prescribe utterance-unit identification rules, which identify two sorts of utterance-units with different granularities: short and long utterance-units. Short utterance-units are identified by acoustic and prosodic disjuncture, and they are considered to constitute units of speaker's planning and hearer's understanding. Long utterance-units, on the other hand, are recognized by syntactic and pragmatic disjuncture, and they are regarded as units of interaction. We explore some characteristics of these utterance-units, focusing particularly on unit duration and syntactic property, other participants' responses, and mismatch between the two-levels. We also discuss how our two-level utterance-units are useful in analyzing cognitive and communicative aspects of spoken dialogs.
2008
pdf
Balanced Corpus of Contemporary Written Japanese
Kikuo Maekawa
Proceedings of the 6th Workshop on Asian Language Resources
2000
pdf
Spontaneous Speech Corpus of Japanese
Kikuo Maekawa
|
Hanae Koiso
|
Sadaoki Furui
|
Hitoshi Isahara
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
1999
pdf
Evaluation of Annotation Schemes for Japanese Discourse Japanese Discourse Tagging Working Group
A. Ichikawa
|
M. Araki
|
Y. Horiuchi
|
M. Ishizaki
|
S. Itabashi
|
W. Itoh
|
H Kashioka
|
K. Kato
|
H. Kikuchi
|
H. Koiso
|
T. Kumagai
|
A. Kurematsu
|
K. Maekawa
|
S. Nakazato
|
M. Tamoto
|
S. Tutiya
|
Y. Yamashita
|
W. Yoshimura
Towards Standards and Tools for Discourse Tagging