Hokuto Ototake


2026

The presented longitudinal cross-regional corpus of Japanese prefectural assembly minutes spans 12 years (2011-2023) across three electoral terms. The corpus comprises 12,236,974 records containing 743,147,226 characters (471,496,688 tokens) of transcribed remarks from the plenary sessions of all 47 prefectural assemblies in Japan. Each dataset is organized by speaker, with assembly members linked to their electoral information, including gender, age, and electoral district. Through a comparative analysis across the three terms, we documented significant temporal changes. The proportion of members aged 25-44 decreased, whereas female representation increased. Female members use 20-30% more characters per speech than male counterparts across all age groups. The proportion of members who never speak varies from under 2% for younger females to over 10% for males aged 65+. We demonstrate the utility of the corpus through three applications: a quantitative analysis of gender and age patterns in political discourse, AI-driven computational dialectology for extracting regional linguistic features, and a web-based search and visualization system. This longitudinal cross-regional corpus provides a valuable resource for interdisciplinary research on subnational politics, computational linguistics, dialectology, and political communication in non-Western democracies. The datasets are available for research purposes upon request, with public query access provided through a web-based interface.
To address challenges in objectivity and efficiency in evaluating the quality of generative AI chatbots, we developed an automatic evaluation framework using the "LLM-as-a-judge" approach. A User Simulator, built with In-Context Learning and LoRA tuning, was employed to generate pseudo-conversation logs of the fan-engagement application OSHIAI. These logs were then automatically evaluated by a Judge LLM across six dimensions, and the contribution of this method to quality management in real-world services was verified.

2022

Budget argument mining attempts to identify argumentative components related to a budget item, and then classifies these argumentative components, given budget information and minutes. We describe the construction of the dataset for budget argument mining, a subtask of QA Lab-PoliInfo-3 in NTCIR-16. Budget argument mining analyses the argument structure of the minutes, focusing on monetary expressions (amount of money). In this task, given sufficient budget information (budget item, budget amount, etc.), relevant argumentative components in the minutes are identified and argument labels (claim, premise, and other) are assigned their components. In this paper, we describe the design of the data format, the annotation procedure, and release information of budget argument mining dataset, to link budget information to minutes.

2020

In this study, we construct a corpus of Japanese local assembly minutes. All speeches in an assembly were transcribed into a local assembly minutes based on the local autonomy law. Therefore, the local assembly minutes form an extremely large amount of text data. Our ultimate objectives were to summarize and present the arguments in the assemblies, and to use the minutes as primary information for arguments in local politics. To achieve this, we structured all statements in assembly minutes. We focused on the structure of the discussion, i.e., the extraction of question and answer pairs. We organized the shared task “QA Lab-PoliInfo” in NTCIR 14. We conducted a “segmentation task” to identify the scope of one question and answer in the minutes as a sub task of the shared task. For the segmentation task, 24 runs from five teams were submitted. Based on the obtained results, the best recall was 1.000, best precision was 0.940, and best F-measure was 0.895.

2016

This paper describes a Japanese political corpus created for interdisciplinary political research. The corpus contains the local assembly minutes of 47 prefectures from April 2011 to March 2015. This four-year period coincides with the term of office for assembly members in most autonomies. We analyze statistical data, such as the number of speakers, characters, and words, to clarify the characteristics of local assembly minutes. In addition, we identify problems associated with the different web services used by the autonomies to make the minutes available to the public.