Hagen Soltau


2023

pdf
AnyTOD: A Programmable Task-Oriented Dialog System
Jeffrey Zhao | Yuan Cao | Raghav Gupta | Harrison Lee | Abhinav Rastogi | Mingqiu Wang | Hagen Soltau | Izhak Shafran | Yonghui Wu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of zero-shot adaptation onto unseen tasks or domains. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events that occur during a conversation, and a symbolic program implementing dialog policy is executed to recommend actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR, ABCD and SGD benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot transfer onto MultiWOZ. In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot task transfer for end-to-end TOD models.

pdf
DSTC-11: Speech Aware Task-Oriented Dialog Modeling Track
Hagen Soltau | Izhak Shafran | Mingqiu Wang | Abhinav Rastogi | Wei Han | Yuan Cao
Proceedings of The Eleventh Dialog System Technology Challenge

Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task – (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.

2022

pdf
Knowledge-grounded Dialog State Tracking
Dian Yu | Mingqiu Wang | Yuan Cao | Laurent El Shafey | Izhak Shafran | Hagen Soltau
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge (including structured knowledge such as schema and ontology and unstructured knowledge such as web corpus) is a critical part of dialog understanding, especially for unseen tasks and domains. Traditionally, such domain-specific knowledge is encoded implicitly into model parameters for the execution of downstream tasks, which makes training inefficient. In addition , such models are not easily transferable to new tasks with different schemas. In this work, we propose to perform dialog state tracking grounded on knowledge encoded externally. We query relevant knowledge of various forms based on the dialog context where such information can grounds the prediction of dialog states. We demonstrate superior performance of our proposed method over strong baselines, especially in the few-shot learning setting.

pdf
Unsupervised Slot Schema Induction for Task-oriented Dialog
Dian Yu | Mingqiu Wang | Yuan Cao | Izhak Shafran | Laurent Shafey | Hagen Soltau
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To alleviate this expensive and time consuming process, we propose an unsupervised approach for slot schema induction from unlabeled dialog corpora. Leveraging in-domain language models and unsupervised parsing structures, our data-driven approach extracts candidate slots without constraints, followed by coarse-to-fine clustering to induce slot types. We compare our method against several strong supervised baselines, and show significant performance improvement in slot schema induction on MultiWoz and SGD datasets. We also demonstrate the effectiveness of induced schemas on downstream applications including dialog state tracking and response generation.

2020

pdf
The Medical Scribe: Corpus Development and Model Performance Analyses
Izhak Shafran | Nan Du | Linh Tran | Amanda Perry | Lauren Keyes | Mark Knichel | Ashley Domin | Lei Huang | Yu-hui Chen | Gang Li | Mingqiu Wang | Laurent El Shafey | Hagen Soltau | Justin Stuart Paul
Proceedings of the Twelfth Language Resources and Evaluation Conference

There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-art tagging model. We report ontologies, labeling results, model performances, and detailed analyses of the results. Our results show that the entities related to medications can be extracted with a relatively high accuracy of 0.90 F-score, followed by symptoms at 0.72 F-score, and conditions at 0.57 F-score. In our task, we not only identify where the symptoms are mentioned but also map them to canonical forms as they appear in the clinical notes. Of the different types of errors, in about 19-38% of the cases, we find that the model output was correct, and about 17-32% of the errors do not impact the clinical note. Taken together, the models developed in this work are more useful than the F-scores reflect, making it a promising approach for practical applications.

2001

pdf
Advances in meeting recognition
Alex Waibel | Hua Yu | Tanja Schultz | Yue Pan | Michael Bett | Martin Westphal | Hagen Soltau | Thomas Schaaf | Florian Metze
Proceedings of the First International Conference on Human Language Technology Research