Task-oriented dialogue systems employ natural language understanding (NLU) modules to manage the intricate and continually evolving business requirements of production systems.Although the development of Large Language Models (LLMs) introduced extraordinary chitchat capabilities, implementing LLMs into such systems brought new difficulties.One of the main challenges is the lack of specific datasets for training and evaluation of systems that offer both capabilities: chat and task. As NLU modules are designed to handle complex task requests and LLMs are utilized to specifically answer chitchat interactions, the system must correctly identify the functional intent of the user to utilize an applicable module. This paper presents CTFusion, a multi-turn dialogue generation framework designed to assist the evaluation and training of production systems that offer both capabilities. Utilizing the framework, we generate a multi-turn dialogue dataset for in-vehicle speech recognition system, which includes 41,211 dialogues of 240 real-world in-vehicle intents, and train In-vehicle Context Sensor (ICS), a lightweight model that successfully identifies the functional intent of the driver.ICS outperforms all baseline models across various experimental settings, which demonstrates that CTFusion can help generate relevant datasets with a complex business logic, which can subsequently assist production systems in leveraging LLMs for their chitchat capabilities.
Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics.To evaluate an agent’s ability to initiate and recover from mode transitions, we propose two new metrics—Switch and Recovery.Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additionalgains, achieving 75.74% joint mode-intent accuracy and a 70.1% win rate against GPT-4o in human evaluation.These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.
To enhance the explainability of meeting summarization, we construct a new dataset called “ExplainMeetSum,” an augmented version of QMSum, by newly annotating evidence sentences that faithfully “explain” a summary. Using ExplainMeetSum, we propose a novel multiple extractor guided summarization, namely Multi-DYLE, which extensively generalizes DYLE to enable using a supervised extractor based on human-aligned extractive oracles. We further present an explainability-aware task, named “Explainable Evidence Extraction” (E3), which aims to automatically detect all evidence sentences that support a given summary. Experimental results on the QMSum dataset show that the proposed Multi-DYLE outperforms DYLE with gains of up to 3.13 in the ROUGE-1 score. We further present the initial results on the E3 task, under the settings using separate and joint evaluation metrics.