Katsumasa Yoshikawa


2026

Large Language Models (LLMs) sometimes generate inconsistent answers when asked semantically equivalent questions expressed with different wordings. Such inconsistency may lead to decreased task performance or excessive agreement with users. This study investigates how question wording influences the answer consistencies of LLMs, focusing on binary Yes/No questions. We design four types of paraphrasing patterns, namely synonym substitution, antonym substitution, addition of agreement-seeking expressions, and strengthened agreement-seeking expressions, and evaluate their impact on model outputs. Experiments with multiple open-source and commercial LLMs show that many models become more likely to answer "Yes" when agreement-seeking expressions are included, and they are particularly vulnerable to antonym substitutions. Our analysis further suggests that some of these tendencies are already present in pretrained models and are not fully removed by post-training. We also provide insights into which factors are likely (or unlikely) to contribute to improving consistency. By providing a systematic evaluation framework, this work highlights the necessity of accounting for wording-induced biases in the development and deployment of LLMs.
We present a persona-aware evaluation suite that couples a 12-category cognitive-bias benchmark with 100 applied financial framing tasks to assess how large language models (LLMs) respond under systematically varied persona conditions. Using a factorized set of 162 personas spanning gender, age, political orientation, income, and education, we analyze how persona conditioning modulates bias-consistent responding across ten instruction-tuned models. On applied tasks, persona conditioning reduces framing reversals on average and slightly increases decision confidence, with substantial variation across model families and scales. Correlation analyses further reveal that benchmark bias tendencies—particularly availability, social proof, and framing—predict applied framing sensitivity, suggesting that standardized bias scores can serve as indicators of real-world decision variability. This work provides a unified framework for linking cognitive-bias evaluation with persona-conditioned decision behavior in LLMs. (All data and prompts will be released after acceptance to preserve anonymity.)

2025

We propose a simple yet effective method for enhancing persona consistency in dialogue response generation using Direct Preference Optimization (DPO). In our method, we generate responses from the response generation model using persona information that has been randomly swapped with data from other dialogues, treating these responses as pseudo-negative samples. The reference responses serve as positive samples, allowing us to create pseudo-preference data. Experimental results demonstrate that our model, fine-tuned with DPO on the pseudo preference data, produces more consistent and natural responses compared to models trained using supervised fine-tuning or reinforcement learning approaches based on entailment relations between personas and utterances.

2024

When individuals engage in spoken discourse, various phenomena can be observed that differ from those that are apparent in text-based conversation. While written communication commonly uses a question mark to denote a query, in spoken discourse, queries are frequently indicated by a rising intonation at the end of a sentence. However, numerous speech recognition engines do not append a question mark to recognized queries, presenting a challenge when creating a spoken dialogue system. Specifically, the absence of a question mark at the end of a sentence can impede the generation of appropriate responses to queries in spoken dialogue systems. Hence, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. Moreover, we analyze specific examples in an effort to determine which types of utterances have the impact on dialogue systems.

2023

With the ambition to create avatars capable of human-level casual conversation, we developed an open-domain avatar chatbot, situated in a virtual reality environment, that employs a large language model (LLM). Introducing the LLM posed several challenges for multimodal integration, such as developing techniques to align diverse outputs and avatar control, as well as addressing the issue of slow generation speed. To address these challenges, we integrated various external modules into our system. Our system is based on the award-winning model from the Dialogue System Live Competition 5. Through this work, we hope to stimulate discussions within the research community about the potential and challenges of multimodal dialogue systems enhanced with LLMs.

2017

This paper presents our system submitted for the CoNLL 2017 Shared Task, “Multilingual Parsing from Raw Text to Universal Dependencies.” We ran the system for all languages with our own fully pipelined components without relying on re-trained baseline systems. To train the dependency parser, we used only the universal part-of-speech tags and distance between words, and applied deterministic rules to assign dependency labels. The simple and delexicalized models are suitable for cross-lingual transfer approaches and a universal language model. Experimental results show that our model performed well in some metrics and leads discussion on topics such as contribution of each component and on syntactic similarities among languages.

2012

2011

2009