2025
pdf
bib
abs
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
Emre Can Acikgoz
|
Jeremiah Greer
|
Akul Datta
|
Ze Yang
|
William Zeng
|
Oussama Elachqar
|
Emmanouil Koukoumidis
|
Dilek Hakkani-Tür
|
Gokhan Tur
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce **CoALM** (**C**onversational **A**gentic **L**anguage **M**odel), a unified approach that integrates both conversational and agentic capabilities. We created **CoALM-IT**, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models **CoALM 8B**, **CoALM 70B**, and **CoALM 405B**, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.
pdf
bib
abs
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
Suvodip Dey
|
Yi-Jyun Sun
|
Gokhan Tur
|
Dilek Hakkani-Tür
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent LLMs have enabled significant advancements for conversational agents. However, they are also well known to hallucinate, producing responses that seem plausible but are factually incorrect. On the other hand, users tend to over-rely on LLM-based AI agents, accepting AI’s suggestion even when it is wrong. Adding positive friction, such as explanations or getting user confirmations, has been proposed as a mitigation in AI-supported decision-making systems. In this paper, we propose an accountability model for LLM-based task-oriented dialogue agents to address user overreliance via friction turns in cases of model uncertainty and errors associated with dialogue state tracking (DST). The accountability model is an augmented LLM with an additional accountability head that functions as a binary classifier to predict the relevant slots of the dialogue state mentioned in the conversation. We perform our experiments with multiple backbone LLMs on two established benchmarks (MultiWOZ and Snips). Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions. We observe around 3% absolute improvement in joint goal accuracy (JGA) of DST output by incorporating accountability heads into modern LLMs. Self-correcting the detected errors further increases the JGA from 67.13 to 70.51, achieving state-of-the-art DST performance. Finally, we show that error correction through user confirmations (friction turn) achieves a similar performance gain, highlighting its potential to reduce user overreliance.
pdf
bib
abs
SMART: Self-Aware Agent for Tool Overuse Mitigation
Cheng Qian
|
Emre Can Acikgoz
|
Hongru Wang
|
Xiusi Chen
|
Avirup Sil
|
Dilek Hakkani-Tür
|
Gokhan Tur
|
Heng Ji
Findings of the Association for Computational Linguistics: ACL 2025
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to **Tool Overuse**, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce **SMART** (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent’s self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce **SMART-ER**, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop **SMARTAgent**, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
pdf
bib
abs
ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents
Vardhan Dongre
|
Xiaocheng Yang
|
Emre Can Acikgoz
|
Suvodip Dey
|
Gokhan Tur
|
Dilek Hakkani-Tur
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Large language model (LLM)-based agents have been increasingly used to interact with external environments (e.g., games, APIs, etc.) and solve tasks. However, current frameworks do not enable these agents to work with users and interact with them to align on the details of their tasks and reach user-defined goals; instead, in ambiguous situations, these agents may make decisions based on assumptions. This work introduces ReSpAct (Reason, Speak, and Act), a novel framework that synergistically combines the essential skills for building task-oriented “conversational” agents. ReSpAct addresses this need for agents, expanding on the ReAct approach. ReSpAct framework enables agents to interpret user instructions, reason about complex tasks, execute appropriate actions and engage in dynamic dialogue to seek guidance, clarify ambiguities, understand user preferences, resolve problems, and use the intermediate feedback and responses of users to update their plans. We evaluated ReSpAct with GPT-4 in environments supporting user interaction, such as task-oriented dialogue (MultiWOZ) and interactive decision-making (Alfworld, WebShop), ReSpAct is flexible enough to incorporate dynamic user feedback and addresses prevalent issues like error propagation and agents getting stuck in reasoning loops. This results in more interpretable, human-like task-solving trajectories than baselines relying solely on reasoning traces. In two interactive decision-making benchmarks, AlfWorld and WebShop, ReSpAct outperforms strong reasoning-only method ReAct by an absolute success rate of 6% and 4%, respectively. In the task-oriented dialogue benchmark MultiWOZ, ReSpAct improved Inform and Success scores by 5.5% and 3%, respectively.
2024
pdf
bib
SG-RAG: Multi-Hop Question Answering With Large Language Models Through Knowledge Graphs
Ahmmad O. M. Saleh
|
Gokhan Tur
|
Yucel Saygin
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)
pdf
bib
abs
Dialog Flow Induction for Constrainable LLM-Based Chatbots
Stuti Agrawal
|
Pranav Pillai
|
Nishi Uppuluri
|
Revanth Gangi Reddy
|
Sha Li
|
Gokhan Tur
|
Dilek Hakkani-Tur
|
Heng Ji
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
LLM-driven dialog systems are used in a diverse set of applications, ranging from healthcare to customer service. However, given their generalization capability, it is difficult to ensure that these chatbots stay within the boundaries of the specialized domains, potentially resulting in inaccurate information and irrelevant responses. This paper introduces an unsupervised approach for automatically inducing domain-specific dialog flows that can be used to constrain LLM-based chatbots. We introduce two variants of dialog flow based on the availability of in-domain conversation instances. Through human and automatic evaluation over 24 dialog domains, we demonstrate that our high-quality data-guided dialog flows achieve better domain coverage, thereby overcoming the need for extensive manual crafting of such flows.
2023
pdf
bib
abs
MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Jack FitzGerald
|
Christopher Hench
|
Charith Peris
|
Scott Mackie
|
Kay Rottmann
|
Ana Sanchez
|
Aaron Nash
|
Liam Urbach
|
Vishesh Kakarala
|
Richa Singh
|
Swetha Ranganath
|
Laurie Crist
|
Misha Britan
|
Wouter Leeuwis
|
Gokhan Tur
|
Prem Natarajan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present the MASSIVE dataset–Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.
2022
pdf
bib
abs
CGF: Constrained Generation Framework for Query Rewriting in Conversational AI
Jie Hao
|
Yang Liu
|
Xing Fan
|
Saurabh Gupta
|
Saleh Soltan
|
Rakesh Chada
|
Pradeep Natarajan
|
Chenlei Guo
|
Gokhan Tur
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
In conversational AI agents, Query Rewriting (QR) plays a crucial role in reducing user frictions and satisfying their daily demands. User frictions are caused by various reasons, such as errors in the conversational AI system, users’ accent or their abridged language. In this work, we present a novel Constrained Generation Framework (CGF) for query rewriting at both global and personalized levels. It is based on the encoder-decoder framework, where the encoder takes the query and its previous dialogue turns as the input to form a context-enhanced representation, and the decoder uses constrained decoding to generate the rewrites based on the pre-defined global or personalized constrained decoding space. Extensive offline and online A/B experiments show that the proposed CGF significantly boosts the query rewriting performance.
pdf
bib
abs
VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator
Ayush Shrivastava
|
Karthik Gopalakrishnan
|
Yang Liu
|
Robinson Piramuthu
|
Gokhan Tur
|
Devi Parikh
|
Dilek Hakkani-Tur
Findings of the Association for Computational Linguistics: ACL 2022
Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language navigation (VLN). In this paper, we present VISITRON, a multi-modal Transformer-based navigator better suited to the interactive regime inherent to Cooperative Vision-and-Dialog Navigation (CVDN). VISITRON is trained to: i) identify and associate object-level concepts and semantics between the environment and dialogue history, ii) identify when to interact vs. navigate via imitation learning of a binary classification head. We perform extensive pre-training and fine-tuning ablations with VISITRON to gain empirical insights and improve performance on CVDN. VISITRON’s ability to identify when to interact leads to a natural generalization of the game-play mode introduced by Roman et al. (2020) for enabling the use of such models in different environments. VISITRON is competitive with models on the static CVDN leaderboard and attains state-of-the-art performance on the Success weighted by Path Length (SPL) metric.
2021
pdf
bib
abs
Generative Conversational Networks
Alexandros Papangelis
|
Karthik Gopalakrishnan
|
Aishwarya Padmakumar
|
Seokhwan Kim
|
Gokhan Tur
|
Dilek Hakkani-Tur
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Inspired by recent work in meta-learning and generative teaching networks, we propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data (given some seed data) and then train themselves from that data to perform a given task. We use reinforcement learning to optimize the data generation process where the reward signal is the agent’s performance on the task. The task can be any language-related task, from intent detection to full task-oriented conversations. In this work, we show that our approach is able to generalise from seed data and performs well in limited data and limited computation settings, with significant gains for intent detection and slot tagging across multiple datasets: ATIS, TOD, SNIPS, and Restaurants8k. We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data. We also conduct an analysis of the novelty of the generated data and provide generated examples for intent detection, slot tagging, and non-goal oriented conversations.
2020
pdf
bib
abs
Controllable Text Generation with Focused Variation
Lei Shu
|
Alexandros Papangelis
|
Yi-Chia Wang
|
Gokhan Tur
|
Hu Xu
|
Zhaleh Feizollahi
|
Bing Liu
|
Piero Molino
Findings of the Association for Computational Linguistics: EMNLP 2020
This work introduces Focused-Variation Network (FVN), a novel model to control language generation. The main problems in previous controlled language generation models range from the difficulty of generating text according to the given attributes, to the lack of diversity of the generated texts. FVN addresses these issues by learning disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity, while at the same time generating fluent text. We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.
2019
pdf
bib
abs
Collaborative Multi-Agent Dialogue Model Training Via Reinforcement Learning
Alexandros Papangelis
|
Yi-Chia Wang
|
Piero Molino
|
Gokhan Tur
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue
Some of the major challenges in training conversational agents include the lack of large-scale data of real-world complexity, defining appropriate evaluation measures, and managing meaningful conversations across many topics over long periods of time. Moreover, most works tend to assume that the conversational agent’s environment is stationary, a somewhat strong assumption. To remove this assumption and overcome the lack of data, we take a step away from the traditional training pipeline and model the conversation as a stochastic collaborative game. Each agent (player) has a role (“assistant”, “tourist”, “eater”, etc.) and their own objectives, and can only interact via language they generate. Each agent, therefore, needs to learn to operate optimally in an environment with multiple sources of uncertainty (its own LU and LG, the other agent’s LU, Policy, and LG). In this work, we present the first complete attempt at concurrently training conversational agents that communicate only via self-generated language and show that they outperform supervised and deep learning baselines.
pdf
bib
abs
Flexibly-Structured Model for Task-Oriented Dialogues
Lei Shu
|
Piero Molino
|
Mahdi Namazifar
|
Hu Xu
|
Bing Liu
|
Huaixiu Zheng
|
Gokhan Tur
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue
This paper proposes a novel end-to-end architecture for task-oriented dialogue systems. It is based on a simple and practical yet very effective sequence-to-sequence approach, where language understanding and state tracking tasks are modeled jointly with a structured copy-augmented sequential decoder and a multi-label decoder for each slot. The policy engine and language generation tasks are modeled jointly following that. The copy-augmented sequential decoder deals with new or unknown values in the conversation, while the multi-label decoder combined with the sequential decoder ensures the explicit assignment of values to slots. On the generation part, slot binary classifiers are used to improve performance. This architecture is scalable to real-world scenarios and is shown through an empirical evaluation to achieve state-of-the-art performance on both the Cambridge Restaurant dataset and the Stanford in-car assistant dataset.
2018
pdf
bib
abs
Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems
Bing Liu
|
Gokhan Tür
|
Dilek Hakkani-Tür
|
Pararth Shah
|
Larry Heck
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
In this work, we present a hybrid learning method for training task-oriented dialogue systems through online user interactions. Popular methods for learning task-oriented dialogues include applying reinforcement learning with user feedback on supervised pre-training models. Efficiency of such learning method may suffer from the mismatch of dialogue state distribution between offline training and online interactive learning stages. To address this challenge, we propose a hybrid imitation and reinforcement learning method, with which a dialogue agent can effectively learn from its interaction with users by learning from human teaching and feedback. We design a neural network based task-oriented dialogue agent that can be optimized end-to-end with the proposed learning method. Experimental results show that our end-to-end dialogue agent can learn effectively from the mistake it makes via imitation learning from user teaching. Applying reinforcement learning with user feedback after the imitation learning stage further improves the agent’s capability in successfully completing a task.
pdf
bib
abs
Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning
Pararth Shah
|
Dilek Hakkani-Tür
|
Bing Liu
|
Gokhan Tür
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
End-to-end neural models show great promise towards building conversational agents that are trained from data and on-line experience using supervised and reinforcement learning. However, these models require a large corpus of dialogues to learn effectively. For goal-oriented dialogues, such datasets are expensive to collect and annotate, since each task involves a separate schema and database of entities. Further, the Wizard-of-Oz approach commonly used for dialogue collection does not provide sufficient coverage of salient dialogue flows, which is critical for guaranteeing an acceptable task completion rate in consumer-facing conversational agents. In this paper, we study a recently proposed approach for building an agent for arbitrary tasks by combining dialogue self-play and crowd-sourcing to generate fully-annotated dialogues with diverse and natural utterances. We discuss the advantages of this approach for industry applications of conversational agents, wherein an agent can be rapidly bootstrapped to deploy in front of users and further optimized via interactive learning from actual users of the system.
2017
pdf
bib
abs
Sequential Dialogue Context Modeling for Spoken Language Understanding
Ankur Bapna
|
Gokhan Tür
|
Dilek Hakkani-Tür
|
Larry Heck
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue
Spoken Language Understanding (SLU) is a key component of goal oriented dialogue systems that would parse user utterances into semantic frame representations. Traditionally SLU does not utilize the dialogue history beyond the previous system turn and contextual ambiguities are resolved by the downstream components. In this paper, we explore novel approaches for modeling dialogue context in a recurrent neural network (RNN) based language understanding system. We propose the Sequential Dialogue Encoder Network, that allows encoding context from the dialogue history in chronological order. We compare the performance of our proposed architecture with two context models, one that uses just the previous turn context and another that encodes dialogue context in a memory network, but loses the order of utterances in the dialogue history. Experiments with a multi-domain dialogue dataset demonstrate that the proposed architecture results in reduced semantic frame error rates.
2013
pdf
bib
Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression
Asli Celikyilmaz
|
Dilek Hakkani-Tur
|
Gokhan Tur
|
Ruhi Sarikaya
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2012
pdf
bib
Mining Search Query Logs for Spoken Language Understanding
Dilek Hakkani-Tür
|
Gokhan Tür
|
Asli Celikyilmaz
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)
2010
pdf
bib
NAACL HLT 2010 Tutorial Abstracts
Jason Baldwin
|
Peter Clark
|
Gokhan Tur
NAACL HLT 2010 Tutorial Abstracts
pdf
bib
LDA Based Similarity Modeling for Question Answering
Asli Celikyilmaz
|
Dilek Hakkani-Tur
|
Gokhan Tur
Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
2009
pdf
bib
Anchored Speech Recognition for Question Answering
Sibel Yaman
|
Gokhan Tur
|
Dimitra Vergyri
|
Dilek Hakkani-Tur
|
Mary Harper
|
Wen Wang
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
pdf
bib
Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Kristen Parton
|
Kathleen R. McKeown
|
Bob Coyne
|
Mona T. Diab
|
Ralph Grishman
|
Dilek Hakkani-Tür
|
Mary Harper
|
Heng Ji
|
Wei Yun Ma
|
Adam Meyers
|
Sara Stolbach
|
Ang Sun
|
Gokhan Tur
|
Wei Xu
|
Sibel Yaman
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
2007
pdf
bib
Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies
Fuliang Weng
|
Ye-Yi Wang
|
Gokhan Tur
Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies
2005
pdf
bib
Using Semantic and Syntactic Graphs for Call Classification
Dilek Hakkani-Tür
|
Gokhan Tur
|
Ananlada Chotimongkol
Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing
2004
pdf
bib
Bootstrapping Spoken Dialog Systems with Data Reuse
Guiseppe Di Fabbrizio
|
Gokhan Tur
|
Dilek Hakkani-Tür
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004
2001
pdf
bib
Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation
G. Tur
|
D. Hakkani-Tur
|
A. Stolcke
|
E. Shriberg
Computational Linguistics, Volume 27, Number 1, March 2001
2000
pdf
bib
Statistical Morphological Disambiguation for Agglutinative Languages
Dilek Z. Hakkani-Tür
|
Kemal Oflazer
|
Gökhan Tür
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
1998
pdf
bib
Tagging English by Path Voting Constraints
Gokhan Tur
|
Kemal Oflazer
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics
pdf
bib
Implementing Voting Constraints with Finite State Transducers
Kemal Oflazer
|
Gokhan Tur
Finite State Methods in Natural Language Processing
1997
pdf
bib
Morphological Disambiguation by Voting Constraints
Kemal Oflazer
|
Gokhan Tur
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics
1996
pdf
bib
Combining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation
Kemal Oflazer
|
Gokhan Tur
Conference on Empirical Methods in Natural Language Processing