International Workshop on Spoken Dialogue Systems Technology (2025)


up

pdf (full)
bib (full)
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

pdf bib
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Maria Ines Torres | Yuki Matsuda | Zoraida Callejas | Arantza del Pozo | Luis Fernando D'Haro

pdf bib
Automatic Generation of Structured Domain Knowledge for Dialogue-based XAI Systems
Carolin Schindler | Isabel Feustel | Niklas Rach | Wolfgang Minker

Explanatory dialogue systems serve as intuitive interface between non-expert users and explainable AI (XAI) systems. The interaction with these kind of systems benefits especially from the integration of structured domain knowledge, e.g., by means of bipolar argumentation trees. So far, these domain-specific structures need to be created manually, therewith impairing the flexibility of the system with respect to the domain. We address this limitation by adapting an existing pipeline for topic-independent acquisition of argumentation trees in the field of persuasive, argumentative dialogue to the area of explanatory dialogue. This shift is achieved by a) introducing and investigating different formulations of auxiliary claims per feature of the explanation of the AI model, b) exploring the influence of pre-grouping of the arguments with respect to the feature they address, c) suggesting adaptions to the existing algorithm of the pipeline for obtaining a tree structure, and d) utilizing a new approach for determining the type of the relationship between the arguments. Through a step-wise expert evaluation for the domain titanic survival, we identify the best performing variant of our pipeline. With this variant we conduct a user study comparing the automatically generated argumentation trees against their manually created counterpart in the domains titanic survival and credit acquisition. This assessment of the suitability of the generated argumentation trees for a later integration into dialogue-based XAI systems as domain knowledge yields promising results.

pdf bib
Exploring the Impact of Modalities on Building Common Ground Using the Collaborative Scene Reconstruction Task
Yosuke Ujigawa | Asuka Shiotani | Masato Takizawa | Eisuke Midorikawa | Ryuichiro Higashinaka | Kazunori Takashio

To deepen our understanding of verbal and non-verbal modalities in establishing common ground, this study introduces a novel “collaborative scene reconstruction task.” In this task, pairs of participants, each provided with distinct image sets derived from the same video, work together to reconstruct the sequence of the original video. The level of agreement between the participants on the image order—quantified using Kendall’s rank correlation coefficient—serves as a measure of common ground construction. This approach enables the analysis of how various modalities contribute to the constraction of common ground. A corpus comprising 40 dialogues from 20 participants was collected and analyzed. The findings suggest that specific gestures play a significant role in fostering common ground, offering valuable insights for the development of dialogue systems that leverage multimodal information to enhance user the counstraction of common ground.

pdf bib
Design, Generation and Evaluation of a Synthetic Dialogue Dataset for Contextually Aware Chatbots in Art Museums
Inass Rachidi | Anas Ezzakri | Jaime Bellver-Soler | Luis Fernando D’Haro

This paper presents the design, synthetic generation, and automated evaluation of ArtGenEval-GPT++, an advanced dataset for training and fine-tuning conversational agents with artificial awareness capabilities targeting to the art domain. Building on the foundation of a previously released dataset (ArtGenEval-GPT), the new version introduces enhancements for greater personalization (e.g., gender, ethnicity, age, and knowledge) while addressing prior limitations, including low-quality dialogues and hallucinations. The dataset comprises approximately 12,500 dyadic, multi-turn dialogues generated using state-of-the-art large language models (LLMs). These dialogues span diverse museum scenarios, incorporating varied visitor profiles, emotional states, interruptions, and chatbot behaviors. Objective evaluations confirm the dataset’s quality and contextual coherence. Ethical considerations, including biases and hallucinations, are analyzed, with proposed directions for improving the dataset utility. This work contributes to the development of personalized, context-aware conversational agents capable of navigating complex, real-world environments, such as museums, to enhance visitor engagement and satisfaction.

pdf bib
A Voice-Controlled Dialogue System for NPC Interaction using Large Language Models
Milan Wevelsiep | Nicholas Thomas Walker | Nicolas Wagner | Stefan Ultes

This paper explores the integration of voice-controlled dialogue systems in narrative-driven video games, addressing the limitations of existing approaches. We propose a hybrid interface that allows players to freely paraphrase predefined dialogue options, combining player expressiveness with narrative cohesion. The prototype was developed in Unity, and a large language model was used to map the transcribed voice input to existing dialogue options. The approach was evaluated in a user study (n=14) that compared the hybrid interface to traditional point-and-click methods. Results indicate the proposed interface enhances player’s degree of joy and perceived freedom while maintaining narrative consistency. The findings provide insights into the design of scalable and engaging voice-controlled systems for interactive storytelling. Future research should focus on reducing latency and refining language model accuracy to further improve user experience and immersion.

pdf bib
A Dialogue System for Semi-Structured Interviews by LLMs and its Evaluation on Persona Information Collection
Ryo Hasegawa | Yijie Hua | Takehito Utsuro | Ekai Hashimoto | Mikio Nakano | Shun Shiramatsu

In this paper, we propose a dialogue control management framework using large language models for semi-structured interviews. Specifically, large language models are used to generate the interviewer’s utterances and to make conditional branching decisions based on the understanding of the interviewee’s responses. The framework enables flexible dialogue control in interview conversations by generating and updating slots and values according to interviewee answers. More importantly, we invented through LLMs’ prompt tuning the framework of accumulating the list of slots generated along the course of incrementing the number of interviewees through the semi-structured interviews. Evaluation results showed that the proposed approach of accumulating the list of generated slots throughout the semi-structured interviews outperform the baseline without accumulating generated slots in terms of the number of persona attributes and values collected through the semi-structured interview.

pdf bib
Exploring Personality-Aware Interactions in Salesperson Dialogue Agents
Sijia Cheng | Wen Yu Chang | Yun-Nung Chen

The integration of dialogue agents into the sales domain requires a deep understanding of how these systems interact with users possessing diverse personas. This study explores the influence of user personas, defined using the Myers-Briggs Type Indicator (MBTI), on the interaction quality and performance of sales-oriented dialogue agents. Through large-scale testing and analysis, we assess the pre-trained agent’s effectiveness, adaptability, and personalization capabilities across a wide range of MBTI-defined user types. Our findings reveal significant patterns in interaction dynamics, task completion rates, and dialogue naturalness, underscoring the future potential for dialogue agents to refine their strategies to better align with varying personality traits. This work not only provides actionable insights for building more adaptive and user-centric conversational systems in the sales domain but also contributes broadly to the field by releasing persona-defined user simulators. These simulators, unconstrained by domain, offer valuable tools for future research and demonstrate the potential for scaling personalized dialogue systems across diverse applications.

pdf bib
ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents
Vardhan Dongre | Xiaocheng Yang | Emre Can Acikgoz | Suvodip Dey | Gokhan Tur | Dilek Hakkani-Tur

Large language model (LLM)-based agents have been increasingly used to interact with external environments (e.g., games, APIs, etc.) and solve tasks. However, current frameworks do not enable these agents to work with users and interact with them to align on the details of their tasks and reach user-defined goals; instead, in ambiguous situations, these agents may make decisions based on assumptions. This work introduces ReSpAct (Reason, Speak, and Act), a novel framework that synergistically combines the essential skills for building task-oriented “conversational” agents. ReSpAct addresses this need for agents, expanding on the ReAct approach. ReSpAct framework enables agents to interpret user instructions, reason about complex tasks, execute appropriate actions and engage in dynamic dialogue to seek guidance, clarify ambiguities, understand user preferences, resolve problems, and use the intermediate feedback and responses of users to update their plans. We evaluated ReSpAct with GPT-4 in environments supporting user interaction, such as task-oriented dialogue (MultiWOZ) and interactive decision-making (Alfworld, WebShop), ReSpAct is flexible enough to incorporate dynamic user feedback and addresses prevalent issues like error propagation and agents getting stuck in reasoning loops. This results in more interpretable, human-like task-solving trajectories than baselines relying solely on reasoning traces. In two interactive decision-making benchmarks, AlfWorld and WebShop, ReSpAct outperforms strong reasoning-only method ReAct by an absolute success rate of 6% and 4%, respectively. In the task-oriented dialogue benchmark MultiWOZ, ReSpAct improved Inform and Success scores by 5.5% and 3%, respectively.

pdf bib
Examining Older Adults’ Motivation for Interacting with Health-Monitoring Conversational Systems Through Field Trials
Mariko Yoshida | Ryo Hori | Yuki Zenimoto | Mayu Urata | Mamoru Endo | Takami Yasuda | Aiko Inoue | Takahiro Hayashi | Ryuichiro Higashinaka

When assessing the health of older adults, oral interviews and written questionnaires are commonly used. However, these methods are time-consuming in terms of both execution and data aggregation. To address this issue, systems utilizing generative AI for health information collection through conversation have been developed and implemented. Despite these advancements, the motivation of older adults to consistently engage with such systems in their daily lives has not been thoroughly explored. In this study, we developed a smart-speaker extension that uses generative AI to monitor health status through casual conversations with older adult users. The system was tested in a two-week home trial with older adult participants. We conducted post-trial questionnaires and interviews, and we analyzed conversation log data. The results revealed that older adult users enjoy interacting with such systems and can integrate their use into their daily routines. Customized notifications through text messages encouraged system use, and the system’s ability to refer to previous conversations and address users by name was identified as a key factor motivating continued use.

pdf bib
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
Shang-Chi Tsai | Yun-Nung Chen

With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients’ medical conditions. However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. If the model can provide appropriate comfort and empathy based on the patient’s negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process. To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient’s emotions while addressing their concerns. The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients’ questions. Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model’s ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers.

pdf bib
Context or Retrieval? Evaluating RAG Methods for Art and Museum QA System
Samuel Ramos-Varela | Jaime Bellver-Soler | Marcos Estecha-Garitagoitia | Luis Fernando D’Haro

Recent studies suggest that increasing the context window of language models could outperform retrieval-augmented generation (RAG) methods in certain tasks. However, in domains such as art and museums, where information is inherently multimodal, combining images and detailed textual descriptions, this assumption needs closer examination. To explore this, we compare RAG techniques with direct large-context input approaches for answering questions about artworks. Using a dataset of painting images paired with textual information, we develop a synthetic database of question-answer (QA) pairs for evaluating these methods. The focus is on assessing the efficiency and accuracy of RAG in retrieving and using relevant information compared to passing the entire textual context to a language model. Additionally, we experiment with various strategies for segmenting and retrieving text to optimise the RAG pipeline. The results aim to clarify the trade-offs between these approaches and provide valuable insights for interactive systems designed for art and museum contexts.

pdf bib
Paralinguistic Attitude Recognition for Spoken Dialogue Systems
Kouki Miyazawa | Zhi Zhu | Yoshinao Sato

Although paralinguistic information is critical for human communication, most spoken dialogue systems ignore such information, hindering natural communication between humans and machines. This study addresses the recognition of paralinguistic attitudes in user speech. Specifically, we focus on four essential attitudes for generating an appropriate system response, namely agreement, disagreement, questions, and stalling. The proposed model can help a dialogue system better understand what the user is trying to convey. In our experiments, we trained and evaluated a model that classified paralinguistic attitudes on a reading-speech dataset without using linguistic information. The proposed model outperformed human perception. Furthermore, experimental results indicate that speech enhancement alleviates the degradation of model performance caused by background noise, whereas reverberation remains a challenge.

pdf bib
Exploring ReAct Prompting for Task-Oriented Dialogue: Insights and Shortcomings
Michelle Elizabeth | Morgan Veyret | Miguel Couceiro | Ondrej Dusek | Lina M. Rojas Barahona

Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) (Yao et al., 2022) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing task-oriented dialogue (TOD). We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs severely underperform state-of-the-art approaches on success rate in simulation, this difference becomes less pronounced in human evaluation. Moreover, compared to the baseline, humans report higher subjective satisfaction with ReAct-LLM despite its lower success rate, most likely thanks to its natural and confidently phrased responses.

pdf bib
Design of a conversational agent to support people on suicide risk
Mario Manso Vázquez | José Manuel Ramírez Sánchez | Carmen García-Mateo | Laura Docío-Fernández | Manuel José Fernández-Iglesias | Beatriz Gómez-Gómez | Beatriz Pinal | Antia Brañas | Alejandro García-Caballero

In this paper, we present a core component of the VisIA project: a conversational agent designed to detect suicide risk factors during real-time chat interactions. By adhering to clinical guidelines and the state-of-the-art theories of suicide, the agent aims to provide a scalable and effective approach to identifying individuals at risk. Preliminary results demonstrate the feasibility and potential of conversational agents in enhancing suicide risk detection.

pdf bib
Optimizing RAG: Classifying Queries for Dynamic Processing
Kabir Olawore | Michael McTear | Yaxin Bi | David Griol

In Retrieval-Augmented Generation (RAG) systems efficient information retrieval is crucial for enhancing user experience and satisfaction, as response times and computational demands significantly impact performance. RAG can be unnecessarily resource-intensive for frequently asked questions (FAQs) and simple questions. In this paper we introduce an approach in which we categorize user questions into simple queries that do not require RAG processing. Evaluation results show that our proposal reduces latency and improves response efficiency compared to systems relying solely on RAG.

pdf bib
Enhancing Proactive Dialogue Systems Through Self-Learning of Reasoning and Action-Planning
Ryosuke Ito | Tetsuya Takiguchi | Yasuo Ariki

A proactive dialogue system refers to a conversational system designed to guide the direction of a conversation in order to achieve pre-defined targets or fulfill specific goals. Recent studies have shown that Proactive Chain-of-Thought, which guides the system to explicitly think through intermediate reasoning and action-planning steps toward a conversational goal before generating a response, can significantly enhance the performance of proactive dialogue systems. However, these improvements primarily focus on prompt-based control, while the potential of fine-tuning Proactive-CoT remains largely unexplored. Furthermore, fine-tuning Proactive-CoT requires manual annotation of reasoning processes and action plans, which incurs significant time and cost. In this study, we propose a novel approach for automatically annotating reasoning processes and action plans through self-learning. This method enables fully automated annotation, significantly reducing the time and cost associated with manual annotation. Experimental results show that models trained using our proposed method outperform those trained with other fine-tuning approaches. These findings highlight the potential of self-learning approaches to advance the development of more robust and efficient proactive dialogue systems.

pdf bib
TrustBoost: Balancing flexibility and compliance in conversational AI systems
David Griol | Zoraida Callejas | Manuel Gil-Martín | Ksenia Kharitonova | Juan Manuel Montero-Martínez | David Pérez Fernández | Fernando Fernández-Martínez

Conversational AI (ConvAI) systems are gaining growing importance as an alternative for more natural interaction with digital services. In this context, Large Language Models (LLMs) have opened new possibilities for less restricted interaction and richer natural language understanding. However, despite their advanced capabilities, LLMs can pose accuracy and reliability problems, as they sometimes generate factually incorrect or contextually inappropriate content that does not fulfill the regulations or business rules of a specific application domain. In addition, they still do not possess the capability to adjust to users’ needs and preferences, showing emotional awareness, while concurrently adhering to the regulations and limitations of their designated domain. In this paper we present the TrustBoost project, which addresses the challenge of improving trustworthiness of ConvAI from two dimensions: cognition (adaptability, flexibility, compliance, and performance) and affectivity (familiarity, emotional dimension, and perception). The duration of the project is from September 2024 to December 2027.

pdf bib
ScriptBoard: Designing modern spoken dialogue systems through visual programming
Divesh Lala | Mikey Elmers | Koji Inoue | Zi Haur Pang | Keiko Ochi | Tatsuya Kawahara

Implementation of spoken dialogue systems can be time-consuming, in particular for people who are not familiar with managing dialogue states and turn-taking in real-time. A GUI-based system where the user can quickly understand the dialogue flow allows rapid prototyping of experimental and real-world systems. In this demonstration we present ScriptBoard, a tool for creating dialogue scenarios which is independent of any specific robot platform. ScriptBoard has been designed with multi-party scenarios in mind and makes use of large language models to both generate dialogue and make decisions about the dialogue flow. This program promotes both flexibility and reproducibility in spoken dialogue research and provides everyone the opportunity to design and test their own dialogue scenarios.

pdf bib
D4AC: A Tool for Developing Multimodal Dialogue Systems without Coding
Mikio Nakano | Ryuichiro Higashinaka

To enable the broader application of dialogue system technology across various fields, it is beneficial to empower individuals with limited programming experience to build dialogue systems. Domain experts, where dialogue system technology is highly relevant, may not necessarily possess expertise in information technology. This paper presents D4AC, which works as a client for text-based dialogue servers. By combining D4AC with a no-code tool for developing text-based dialogue servers, it is possible to build multimodal dialogue systems without coding. These systems can adapt to the user’s age, gender, emotions, and engagement levels obtained from their facial images. D4AC can be installed, launched, and configured without technical knowledge. D4AC was used in student projects at a university, which suggested the effectiveness of D4AC.

pdf bib
A Multilingual Speech-Based Driver Assistant for Basque and English
Antonio Aparicio Akcharov | Asier López Zorrilla | Juan Camilo Vásquez Correa | Oscar Montserrat | José Maria Echevarría | Begoña Arrate | Joxean Zapirain | Mikel deVelasco Vázquez | Santiago Andrés Moreno-Acevedo | Ander González-Docasal | Maria Ines Torres | Aitor Álvarez

This demo paper presents a prototype of a multilingual, speech-based driver assistant, designed to support both English and Basque languages. The inclusion of Basque—a low-resource language with limited domain-specific training data—marks a significant contribution, as publicly available AI models, including Large Language Models, often underperform for such languages compared to high-resource languages like English. Despite these challenges, our system demonstrates robust performance, successfully understanding user queries and delivering rapid responses in a demanding environment: a car simulator. Notably, the system achieves comparable performance in both English and Basque, showcasing its effectiveness in addressing linguistic disparities in AI-driven applications. A demo of our prototype will be available in the workshop.

pdf bib
Intimebot – A Dialogue Agent for Timekeeping Support
Shoaib Khan | Alex Samani | Rafael Banchs

This demo paper presents intimebot, an AI-powered timekeeping solution designed to assist with timekeeping. Timekeeping is a fundamental but also overwhelming and complex task in many professional services practices. Our intimebot demo demonstrates how Artificial Intelligence can be utilized to implement a more efficient timekeeping process within a firm. Based on brief work descriptions provided by the timekeeper, intimebot is able to (1) predict the relevant combination of client, matter, and phase, (2) estimate the work effort hours, and (3) rewrite and normalize the provided work description into a compliant narrative. This can save a significant amount of time for busy professionals while ensuring terms of business compliance and best practices.

pdf bib
A Chatbot for Providing Suicide Prevention Information in Spanish
Pablo Ascorbe | María S. Campos | César Domínguez | Jónathan Heras | Magdalena Pérez | Ana Rosa Terroba-Reinares

Suicide has been identified by the World Health Organization as one of the most serious health problems that can affect people. Among the interventions that have been proposed to help people suffering from this problem and their relatives, the dissemination of accurate information is crucial. To achieve this goal, we have developed PrevenIA, a chatbot that provides reliable information on suicide prevention. The chatbot consists of a Retrieval Augmented Module for answering users’ queries based on a curated list of documents. In addition, it includes several models to avoid undesirable behaviours. The system has been validated by specialists and is currently being evaluated by different populations. Thanks to this project, reliable information on suicide will be disseminated in an easy and understandable form.

pdf bib
LAMIA: An LLM Approach for Task-Oriented Dialogue Systems in Industry 5.0
Cristina Fernandez | Izaskun Fernandez | Cristina Aceta

Human-Machine Interaction (HMI) plays an important role in Industry 5.0, improving worker well-being by automating repetitive tasks and enhancing seamless collaboration between humans and intelligent systems. In this context, Task-Oriented Dialogue (TOD) systems are a commonly used approach to enable natural communication in these settings, traditionally developed using rule-based approaches. However, the revolution of Large Language Models (LLMs) is changing how dialogue systems are being developed without the necessity of relying on tedious and rigid handcrafted rules. Despite their popularity, their application in industrial contexts remains underexplored, necessitating a solution to challenges such as hallucinations, lack of domain-specific data, high training costs, and limited adaptability. In order to explore the contribution of LLMs in the industry field, this work presents LAMIA, a task-oriented dialogue system for industrial scenarios that leverages LLMs through prompt tuning. This system has been adapted and evaluated for a bin-picking use case, using GPT-3.5 Turbo, showing to be an intuitive method for new use cases in Industry 5.0.

pdf bib
Conversational Tutoring in VR Training: The Role of Game Context and State Variables
Maia Aguirre | Ariane Méndez | Aitor García-Pablos | Montse Cuadros | Arantza del Pozo | Oier Lopez de Lacalle | Ander Salaberria | Jeremy Barnes | Pablo Martínez | Muhammad Zeshan Afzal

Virtual Reality (VR) training provides safe, cost-effective engagement with lifelike scenarios but lacks intuitive communication between users and the virtual environment. This study investigates the use of Large Language Models (LLMs) as conversational tutors in VR health and safety training, examining the impact of game context and state variables on LLM-generated answers in zero- and few-shot settings. Results demonstrate that incorporating both game context and state information significantly improves answer accuracy, with human evaluations showing gains of up to 0.26 points in zero-shot and 0.18 points in few-shot settings on a 0-1 scale.

pdf bib
A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models
Mikio Nakano | Hironori Takeuchi | Kazunori Komatani

This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.

pdf bib
Speech-Controlled Smart Speaker for Accurate, Real-Time Health and Care Record Management
Jonathan E. Carrick | Nina Dethlefs | Lisa Greaves | Venkata M. V. Gunturi | Rameez Raja Kureshi | Yongqiang Cheng

To help alleviate the pressures felt by care workers, we have begun new research into improving the efficiency of care plan management by advancing recent developments in automatic speech recognition. Our novel approach adapts off-the-shelf tools in a purpose-built application for the speech domain, addressing challenges of accent adaption, real-time processing and speech hallucinations. We augment the speech-recognition scope of Open AI’s Whisper model through fine-tuning, reducing word error rates (WERs) from 16.8 to 1.0 on a range of British dialects. Addressing the speech-hallucination side effect of adapting to real-time recognition by enforcing a signal-to-noise ratio threshold and audio stream checks, we achieve a WER of 5.1, compared to 14.9 with Whisper’s original model. These ongoing research efforts tackle challenges that are necessary to build the speech-control basis for a custom smart speaker system that is both accurate and timely.

pdf bib
Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue
Kenta Yamamoto | Ryu Takeda | Kazunori Komatani

In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.

pdf bib
A Survey of Recent Advances on Turn-taking Modeling in Spoken Dialogue Systems
Galo Castillo-López | Gael de Chalendar | Nasredine Semmar

The rapid growth of dialogue systems adoption to serve humans in daily tasks has increased the realism expected from these systems. One trait of realism is the way speaking agents take their turns. We provide here a review of recent methods on turn-taking modeling and thoroughly describe the corpora used in these studies. We observe that 72% of the reviewed works in this survey do not compare their methods with previous efforts. We argue that one of the challenges in the field is the lack of well-established benchmarks to monitor progress. This work aims to provide the community with a better understanding of the current state of research around turn-taking modeling and future directions to build more realistic spoken conversational agents.

pdf bib
Integrating Respiration into Voice Activity Projection for Enhancing Turn-taking Performance
Takao Obi | Kotaro Funakoshi

Voice Activity Projection (VAP) models predict upcoming voice activities on a continuous timescale, enabling more nuanced turn-taking behaviors in spoken dialogue systems. Although previous studies have shown robust performance with audio-based VAP, the potential of incorporating additional physiological information, such as respiration, remains relatively unexplored. In this paper, we investigate whether respiratory information can enhance VAP performance in turn-taking. To this end, we collected Japanese dialogue data with synchronized audio and respiratory waveforms, and then we integrated the respiratory information into the VAP model. Our results showed that the VAP model combining audio and respiratory information had better performance than the audio-only model. This finding underscores the potential for improving the turn-taking performance of VAP by incorporating respiration.

pdf bib
DSLCMM: A Multimodal Human-Machine Dialogue Corpus Built through Competitions
Ryuichiro Higashinaka | Tetsuro Takahashi | Shinya Iizuka | Sota Horiuchi | Michimasa Inaba | Zhiyang Qi | Yuta Sasaki | Kotaro Funakoshi | Shoji Moriya | Shiki Sato | Takashi Minato | Kurima Sakai | Tomo Funayama | Masato Komuro | Hiroyuki Nishikawa | Ryosaku Makino | Hirofumi Kikuchi | Mayumi Usami

A corpus of dialogues between multimodal systems and humans is indispensable for the development and improvement of such systems. However, there is a shortage of human-machine multimodal dialogue datasets, which hinders the widespread deployment of these systems in society. To address this issue, we construct a Japanese multimodal human-machine dialogue corpus, DSLCMM, by collecting and organizing data from the Dialogue System Live Competitions (DSLCs). This paper details the procedure for constructing the corpus and presents our analysis of the relationship between various dialogue features and evaluation scores provided by users.

pdf bib
Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models
Jaime Bellver-Soler | Mario Rodriguez-Cantelar | Ricardo Córdoba | Luis Fernando D’Haro

Recent developments in Multimodal Large Language Models (MLLMs) have provided novel insights into Speech Emotion Recognition (SER). However, combining high-dimensional speech signals with textual tokens can lead to a rapid growth in input tokens, increasing computational costs and inference times. This “token overload” also risks shadowing essential textual cues, affecting the reasoning capabilities of the language model and diluting emotional information crucial to accurate SER. In this paper, we explore different token drop methods that mitigate excessive token counts while preserving both emotional nuances and the core linguistic capabilities of the model. Specifically, we compare various pooling approaches to produce a compact representation. Our preliminary findings suggest that these techniques can reduce computational costs without decreasing SER accuracy.

pdf bib
Integrating Conversational Entities and Dialogue Histories with Knowledge Graphs and Generative AI
Graham Wilcock | Kristiina Jokinen

Existing methods for storing dialogue history and for tracking mentioned entities in spoken dialogues usually handle these tasks separately. Recent advances in knowledge graphs and generative AI make it possible to integrate them in a framework with a uniform representation for dialogue management. This may help to build more natural and grounded dialogue models that can reduce misunderstanding and lead to more reliable dialogue-based interactions with AI agents. The paper describes ongoing work on this approach.

pdf bib
Enabling Trait-based Personality Simulation in Conversational LLM Agents: Case Study of Customer Assistance in French
Ahmed Njifenjou | Virgile Sucal | Bassam Jabaian | Fabrice Lefèvre

Among the numerous models developed to represent the multifaceted complexity of human personality, particularly in psychology, the Big Five (commonly referred to as ‘OCEAN’, an acronym of its five traits) stands out as a widely used framework. Although personalized chatbots have incorporated this model, existing approaches, such as focusing on individual traits or binary combinations, may not capture the full diversity of human personality. In this study, we propose a five-dimensional vector representation, where each axis corresponds to the degree of presence of an OCEAN trait on a continuous scale from 0 to 1. This representation is designed to enable greater versatility in modeling personality. Application to customer assistance scenarios in French demonstrates that, based on humans-bots as well as bots-bots conversations, assigned personality vectors are distinguishable by both humans and LLMs acting as judges. Both of their subjective evaluations also confirm the measurable impacts of the assigned personality on user experience, agent efficiency, and conversation quality.

pdf bib
Developing Classifiers for Affirmative and Negative User Responses with Limited Target Domain Data for Dialogue System Development Tools
Yunosuke Kubo | Ryo Yanagimoto | Mikio Nakano | Kenta Yamamoto | Ryu Takeda | Kazunori Komatani

We aim to develop a library for classifying affirmative and negative user responses, intended for integration into a dialogue system development toolkit. Such a library is expected to highly perform even with minimal annotated target domain data, addressing the practical challenge of preparing large datasets for each target domain. This short paper compares several approaches under conditions where little or no annotated data is available in the target domain. One approach involves fine-tuning a pre-trained BERT model, while the other utilizes a GPT API for zero-shot or few-shot learning. Since these approaches differ in execution speed, development effort, and execution costs, in addition to performance, the results serve as a basis for discussing an appropriate configuration suited to specific requirements. Additionally, we have released the training data and the fine-tuned BERT model for Japanese affirmative/negative classification.

pdf bib
Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation
Koji Inoue | Mikey Elmers | Divesh Lala | Tatsuya Kawahara

Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including “Empathy and Affinity” and “Humor and Surprise,” highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4o’s performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.

pdf bib
Adaptive Psychological Distance in Japanese Spoken Human-Agent Dialogue: A Politeness-Based Management Model
Akira Inaba | Emmanuel Ayedoun | Masataka Tokumaru

While existing spoken dialogue systems can adapt various aspects of interaction, systematic management of psychological distance through verbal politeness remains underexplored. Current approaches typically maintain fixed levels of formality and social distance, limiting naturalness in long-term human-agent interactions. We propose a novel dialogue management model that dynamically adjusts verbal politeness levels in Japanese based on user preferences. We evaluated the model using two pseudo-users with distinct distance preferences in daily conversations. Human observers (n=20) assessed the interactions, with 70% successfully distinguishing the intended social distance variations. The results demonstrate that systematic modulation of verbal politeness can create perceptibly different levels of psychological distance in spoken dialogue, with implications for culturally appropriate human-agent interaction in Japanese contexts.

pdf bib
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

pdf bib
Will AI shape the way we speak? The emerging sociolinguistic influence of synthetic voices
Eva Szekely | Jura Miniota | Míša (Michaela) Hejná

The growing prevalence of conversational voice interfaces, powered by developments in both speech and language technologies, raises important questions about their influence on human communication. While written communication can signal identity through lexical and stylistic choices, voice-based interactions inherently amplify socioindexical elements – such as accent, intonation, and speech style – which more prominently convey social identity and group affiliation. There is evidence that even passive media such as television is likely to influence the audience’s linguistic patterns. Unlike passive media, conversational AI is interactive, creating a more immersive and reciprocal dynamic that holds a greater potential to impact how individuals speak in everyday interactions. Such heightened influence can be expected to arise from phenomena such as acoustic-prosodic entrainment and linguistic accommodation, which occur naturally during interaction and enable users to adapt their speech patterns in response to the system. While this phenomenon is still emerging, its potential societal impact could provide organisations, movements, and brands with a subtle yet powerful avenue for shaping and controlling public perception and social identity. We argue that the socioindexical influence of AI-generated speech warrants attention and should become a focus of interdisciplinary research, leveraging new and existing methodologies and technologies to better understand its implications.