This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
KikuoMaekawa
Also published as:
K. Maekawa
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.
The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named ‘BonTen’ which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.
In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.
Speakers' gender and age-group were predicted using the symbolic information of the X-JToBI prosodic labelling scheme as applied to the Core of the Corpus of Spontaneous Japanese (44 hours, 155 speakers, 201 talks). The correct prediction rate of speaker gender by means of logistic regression analysis was about 80%, and, the correct discrimination rate of speaker age-group (4 groups) by means of linear discriminant analysis was about 50 %. These results, in conjunction with the previously reported result of the prediction experiment of 4 speech registers from the X-JToBI information, shows convincingly the superiority of X-JToBI over the traditional J_ToBI. Clarification of the mechanism by which gender- and/or age-group information were reflected in the symbolic representations of prosody largely remains as open question, although some preliminary analyses were presented in the current paper.
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of word. Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.
In this paper, we propose a scheme for annotating utterance-level units in Japanese dialogs, which emerged from an analysis of the interrelationship among four schemes, i) inter-pausal units, ii) intonation units, iii) clause units, and iv) pragmatic units. The associations among the labels of these four units were illustrated by multiple correspondence analysis and hierarchical cluster analysis. Based on these results, we prescribe utterance-unit identification rules, which identify two sorts of utterance-units with different granularities: short and long utterance-units. Short utterance-units are identified by acoustic and prosodic disjuncture, and they are considered to constitute units of speaker's planning and hearer's understanding. Long utterance-units, on the other hand, are recognized by syntactic and pragmatic disjuncture, and they are regarded as units of interaction. We explore some characteristics of these utterance-units, focusing particularly on unit duration and syntactic property, other participants' responses, and mismatch between the two-levels. We also discuss how our two-level utterance-units are useful in analyzing cognitive and communicative aspects of spoken dialogs.