Keiichi Takamaru


2026

The presented longitudinal cross-regional corpus of Japanese prefectural assembly minutes spans 12 years (2011-2023) across three electoral terms. The corpus comprises 12,236,974 records containing 743,147,226 characters (471,496,688 tokens) of transcribed remarks from the plenary sessions of all 47 prefectural assemblies in Japan. Each dataset is organized by speaker, with assembly members linked to their electoral information, including gender, age, and electoral district. Through a comparative analysis across the three terms, we documented significant temporal changes. The proportion of members aged 25-44 decreased, whereas female representation increased. Female members use 20-30% more characters per speech than male counterparts across all age groups. The proportion of members who never speak varies from under 2% for younger females to over 10% for males aged 65+. We demonstrate the utility of the corpus through three applications: a quantitative analysis of gender and age patterns in political discourse, AI-driven computational dialectology for extracting regional linguistic features, and a web-based search and visualization system. This longitudinal cross-regional corpus provides a valuable resource for interdisciplinary research on subnational politics, computational linguistics, dialectology, and political communication in non-Western democracies. The datasets are available for research purposes upon request, with public query access provided through a web-based interface.

2024

In this paper, a new dataset for Stance Classification based on assembly minutes is introduced. We develop it by using publicity available minutes taken from diverse Japanese local governments including prefectural, city, and town assemblies. In order to make the task to predict a stance from content of a politician’s utterance without explicit stance expressions, predefined words that directly convey the speaker’s stance in the utterance are replaced by a special token. Those masked words are also used to assign a golden label, either agreement or disagreement, to the utterance. Finally, we constructed total 15,018 instances automatically from 47 Japanese local governments. The dataset is used in the shared Stance Classification task evaluated in the NTCIR-17 QA-Lab-PoliInfo-4, and is now publicity available. Since the construction method of the dataset is automatic, we can still apply it to obtain more instances from the other Japanese local governments.

2020

In this study, we construct a corpus of Japanese local assembly minutes. All speeches in an assembly were transcribed into a local assembly minutes based on the local autonomy law. Therefore, the local assembly minutes form an extremely large amount of text data. Our ultimate objectives were to summarize and present the arguments in the assemblies, and to use the minutes as primary information for arguments in local politics. To achieve this, we structured all statements in assembly minutes. We focused on the structure of the discussion, i.e., the extraction of question and answer pairs. We organized the shared task “QA Lab-PoliInfo” in NTCIR 14. We conducted a “segmentation task” to identify the scope of one question and answer in the minutes as a sub task of the shared task. For the segmentation task, 24 runs from five teams were submitted. Based on the obtained results, the best recall was 1.000, best precision was 0.940, and best F-measure was 0.895.

2016

This paper describes a Japanese political corpus created for interdisciplinary political research. The corpus contains the local assembly minutes of 47 prefectures from April 2011 to March 2015. This four-year period coincides with the term of office for assembly members in most autonomies. We analyze statistical data, such as the number of speakers, characters, and words, to clarify the characteristics of local assembly minutes. In addition, we identify problems associated with the different web services used by the autonomies to make the minutes available to the public.