Atsushi Otsuka
2026
Multi-dimensional Evaluation of Character-Authentic Dialogue Models Learned from Question-Answer Data
Atsushi Otsuka | Kazuya Matsuo | Kenta Hama | Masahiro Mizukami | Tsunehiro Arimoto | Hiroaki Sugiyama | Makoto Nakatsuji | Narichika Nomoto
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Atsushi Otsuka | Kazuya Matsuo | Kenta Hama | Masahiro Mizukami | Tsunehiro Arimoto | Hiroaki Sugiyama | Makoto Nakatsuji | Narichika Nomoto
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Character-authentic dialogue remains challenging for large language models (LLMs) due to limited character-specific data, generic-style collapse, and hallucinations regarding persona facts. Our work presents a comparative evaluation of several learning strategies for character dialogue grounded in question–answer (QA) data, comparing zero/few-shot prompting, supervised fine-tuning (SFT), direct preference optimization (DPO), and a hybrid approach that integrates retrieval-augmented character profiles and knowledge with policy optimization. Using both single-turn and multi-turn settings, we assess multiple dimensions central to character dialogue quality: reproducibility, diversity, hallucination, and character authenticity. Results show that SFT excels in reproducibility and hallucination reduction but tends to shorten and simplify outputs, thereby reducing diversity and authenticity. DPO improves stylistic fidelity and authenticity but depends strongly on externalized character knowledge to limit hallucinations. The hybrid variant that combines character-knowledge retrieval with DPO achieves the best overall balance, delivering strong authenticity while maintaining factual consistency and competitive reproducibility in both single- and multi-turn dialogues. We further analyze the model’s sensitivity to knowledge retrieval and response-length effects and discuss trade-offs among optimization targets that inform practical design choices for developing faithful and engaging character agents trained from scalable QA resources.
Topic-Initiator: A Proactive Chatbot with Personalized Topic RAG for Enhancing Willingness to Converse
Kazuya Matsuo | Atsushi Otsuka | Narichika Nomoto | Makoto Nakatsuji
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kazuya Matsuo | Atsushi Otsuka | Narichika Nomoto | Makoto Nakatsuji
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Stimulating users’ conversational willingness to converse remains a major challenge in chatbot research. Most existing chatbots respond passively to user inputs, relying on users to select conversation topics, which often reduces their willingness. To address this issue, we propose, Topic-Initiator, a proactive chatbot that initiates conversations with new topics aligned to user interests. It gathers information from external sources (e.g., the web) to obtain potentially novel and engaging topics. To support this capability, we also introduce a novel Retrieval-Augmented Generation (RAG) framework, Personalized-Topic RAG (PT-RAG), designed to retrieve new and interesting topics for each user. Unlike existing RAG methods that fails to surface unseen information, PT-RAG leverages the inference capabilities of Large Language Models (LLMs) to identify content that matches the user’s interests but is not yet known to them. Specifically, PT-RAG estimates a user’s interests and knowledge from past interactions and organizes collected information into categories. Then, it uses an LLM to select a category that matches their interests and obtain information not seen in their knowledge from the selected category. Automatic and human evaluations demonstrate that PT-RAG retrieves new and interesting information more accurately and that Topic-Initiator significantly enhances users’ willingness to converse compared to existing methods.
2025
RaPSIL: A Preference‐Guided Interview Agent for Rapport‐Aware Self‐Disclosure
Kenta Hama | Atsushi Otsuka | Masahiro Mizukami | Hiroaki Sugiyama | Makoto Naka
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Kenta Hama | Atsushi Otsuka | Masahiro Mizukami | Hiroaki Sugiyama | Makoto Naka
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Facilitating self-disclosure without causing discomfort remains a difficult task—especially for AI systems. In real-world applications such as career counseling, wellbeing support, and onboarding interviews, eliciting personal information like concerns, goals, and personality traits is essential. However, asking such questions directly often leads to discomfort and disengagement. We address this issue with RaPSIL (Rapport-aware Preference-guided Self-disclosure Interview Learner), a two-stage LLM-based system that fosters natural, engaging conversations to promote self-disclosure. In the first stage, RaPSIL selectively imitates interviewer utterances that have been evaluated by LLMs for both strategic effectiveness and social sensitivity. It leverages LLMs as multi-perspective judges in this selection process. In the second stage, it conducts self-play simulations, using the Reflexion framework to analyze failures and expand a database with both successful and problematic utterances. This dual learning process allows RaPSIL to go beyond simple imitation, improving its ability to handle sensitive topics naturally by learning from both successful and failed utterances. In a comprehensive evaluation with real users, RaPSIL outperformed baselines in enjoyability, warmth, and willingness to re-engage, while also capturing self-descriptions more accurately. Notably, its impression scores remained stable even during prolonged interactions, demonstrating its ability to balance rapport building with effective information elicitation. These results show that RaPSIL enables socially aware AI interviewers capable of eliciting sensitive personal information while maintaining user trust and comfort—an essential capability for real-world dialogue systems.
2024
Analysis of Sensation-transfer Dialogues in Motorsports
Takeru Isaka | Atsushi Otsuka | Iwaki Toshima
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Takeru Isaka | Atsushi Otsuka | Iwaki Toshima
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Clarifying the effects of subjective ideas on group performance is essential for future dialogue systems to improve mutual understanding among humans and group creativity. However, there has been little focus on dialogue research on quantitatively analyzing the effects of the quality and quantity of subjective information contained in dialogues on group performance. We hypothesize that the more subjective information interlocutors exchange, the better the group performance in collaborative work. We collected dialogues between drivers and engineers in motorsports when deciding how the car should be tuned as a suitable case to verify this hypothesis. Our analysis suggests that the greater the amount of subjective information (which we defined as “sensation”) in the driver’s utterances, the greater the race performance and driver satisfaction with the car’s tuning. The results indicate that it is essential for the development of dialogue research to create a corpus of situations that require high performance through collaboration among experts with different backgrounds but who have mastered their respective fields.
2019
Multi-style Generative Reading Comprehension
Kyosuke Nishida | Itsumi Saito | Kosuke Nishida | Kazutoshi Shinoda | Atsushi Otsuka | Hisako Asano | Junji Tomita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Kyosuke Nishida | Itsumi Saito | Kosuke Nishida | Kazutoshi Shinoda | Atsushi Otsuka | Hisako Asano | Junji Tomita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
This study tackles generative reading comprehension (RC), which consists of answering questions based on textual evidence and natural language generation (NLG). We propose a multi-style abstractive summarization model for question answering, called Masque. The proposed model has two key characteristics. First, unlike most studies on RC that have focused on extracting an answer span from the provided passages, our model instead focuses on generating a summary from the question and multiple passages. This serves to cover various answer styles required for real-world applications. Second, whereas previous studies built a specific model for each answer style because of the difficulty of acquiring one general model, our approach learns multi-style answers within a model to improve the NLG capability for all styles involved. This also enables our model to give an answer in the target style. Experiments show that our model achieves state-of-the-art performance on the Q&A task and the Q&A + NLG task of MS MARCO 2.1 and the summary task of NarrativeQA. We observe that the transfer of the style-independent NLG capability to the target style is the key to its success.
Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction
Kosuke Nishida | Kyosuke Nishida | Masaaki Nagata | Atsushi Otsuka | Itsumi Saito | Hisako Asano | Junji Tomita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Kosuke Nishida | Kyosuke Nishida | Masaaki Nagata | Atsushi Otsuka | Itsumi Saito | Hisako Asano | Junji Tomita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Question answering (QA) using textual sources for purposes such as reading comprehension (RC) has attracted much attention. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. It proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model. QFE is inspired by extractive summarization models; compared with the existing method, which extracts each evidence sentence independently, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence. It enables QFE to consider the dependency among the evidence sentences and cover important information in the question sentence. Experimental results show that QFE with a simple RC baseline model achieves a state-of-the-art evidence extraction score on HotpotQA. Although designed for RC, it also achieves a state-of-the-art evidence extraction score on FEVER, which is a recognizing textual entailment task on a large textual database.
2015
Discourse Relation Recognition by Comparing Various Units of Sentence Expression with Recursive Neural Network
Atsushi Otsuka | Toru Hirano | Chiaki Miyazaki | Ryo Masumura | Ryuichiro Higashinaka | Toshiro Makino | Yoshihiro Matsuo
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation
Atsushi Otsuka | Toru Hirano | Chiaki Miyazaki | Ryo Masumura | Ryuichiro Higashinaka | Toshiro Makino | Yoshihiro Matsuo
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation