Kouta Nakayama


2026

Since texts generated by large language models (LLMs) may contain misinformation (hallucinations), develop- ing fact-checking systems capable of assessing their veracity has become increasingly important. One of the mainstream approaches to fact-checking is the claim-based one, which first decomposes a generated text into claims, i.e., independent and atomic units of information. Each claim is then used as a query to retrieve supporting evidence, and a verdict is predicted for each claim-evidence pair. Conducting fact-checking at the claim level enhances the explainability of verification results. However, achieving highly accurate verification requires that the text be decomposed into claims at an appropriate level of granularity. To address this, we constructed a dataset for Japanese claim decomposition. As part of this dataset construction, we design detailed guidelines for claim decomposition, ensuring that the extracted claims are in a form useful for fact-checking and that the decomposition rules mitigate annotator variability. Quantitative evaluation confirmed that the constructed dataset is of high quality. Additionally, experiments on prompt-based claim decomposition using the constructed dataset demonstrated that adding high-quality few-shot examples and guidelines to prompts improved performance.
Instruction following, the ability to generate text that aligns with human intent, is a core capability of large language models (LLMs) for real-world applications. Instruction tuning is widely used to obtain this capability, but it requires large amounts of annotated data. To reduce the labor and cost of large-scale annotation, data augmentation using LLMs has been proposed as a promising approach. As this approach has primarily been applied to English datasets, its effectiveness in other languages, such as Japanese, remains unclear. In this paper, we propose an automatic pipeline for generating instruction and preference datasets in Japanese. The instruction dataset is created by expanding a manually annotated dataset using an LLM. The preference dataset is then constructed by adding LLM-generated negative examples to the instruction dataset. To ensure the quality of the datasets, instructions and responses are evaluated using LLM-as-a-Judge and ROUGE-L. Experimental results using supervised fine-tuning and direct preference optimization demonstrate that these synthetic datasets improve the instruction-following capability in Japanese.
In this paper we present JLLMSafety, a dataset for promoting the safety of Japanese LLM outputs. The dataset consists of 1,800 pairs of questions and reference answers, where the questions require special attention in answering. It covers a wide range of risk categories established in prior English-language datasets, but the data samples are original in that they are manually curated to reflect the socio-cultural context of LLM usage in Japan. We show that using this dataset for instruction to fine-tune a Japanese LLM led to improved output safety without compromising the utility of general responses. We also report the results of a safety evaluation of 12 Japanese LLMs using this dataset as a benchmark. Finally, we discuss the significance of creating regionally specific datasets of LLM safety, and describe the meta tags we added to the dataset to facilitate the creation of similar datasets in different languages and regions. The dataset is made available publicly for the sole purpose of improving LLM safety without any other usage restrictions.

2025

Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored.We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations.Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring.Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations.Our code is available at https://github.com/momo0817/checklist-effectiveness-study.

2022

This paper describes a resource of Wikipedias in 31 languages categorized into Extended Named Entity (ENE), which has 219 fine-grained NE categories. We first categorized 920 K Japanese Wikipedia pages according to the ENE scheme using machine learning, followed by manual validation. We then organized a shared task of Wikipedia categorization into 30 languages. The training data were provided by Japanese categorization and the language links, and the task was to categorize the Wikipedia pages into 30 languages, with no language links from Japanese Wikipedia (20M pages in total). Thirteen groups with 24 systems participated in the 2020 and 2021 tasks, sharing their outputs for resource-building. The Japanese categorization accuracy was 98.5%, and the best performance among the 30 languages ranges from 80 to 93 in F-measure. Using ensemble learning, we created outputs with an average F-measure of 86.8, which is 1.7 better than the best single systems. The total size of the resource is 32.5M pages, including the training data. We call this resource creation scheme “Resource by Collaborative Contribution (RbCC)”. We also constructed structuring tasks (attribute extraction and link prediction) using RbCC under our ongoing project, “SHINRA.”

2021

Shared tasks have a long history and have become the mainstream of NLP research. Most of the shared tasks require participants to submit only system outputs and descriptions. It is uncommon for the shared task to request submission of the system itself because of the license issues and implementation differences. Therefore, many systems are abandoned without being used in real applications or contributing to better systems. In this research, we propose a scheme to utilize all those systems which participated in the shared tasks. We use all participated system outputs as task teachers in this scheme and develop a new model as a student aiming to learn the characteristics of each system. We call this scheme “Co-Teaching.” This scheme creates a unified system that performs better than the task’s single best system. It only requires the system outputs, and slightly extra effort is needed for the participants and organizers. We apply this scheme to the “SHINRA2019-JP” shared task, which has nine participants with various output accuracies, confirming that the unified system outperforms the best system. Moreover, the code used in our experiments has been released.