Kallirroi Georgila

2024

pdf bib abs
Comparing Pre-Trained Embeddings and Domain-Independent Features for Regression-Based Evaluation of Task-Oriented Dialogue Systems
Kallirroi Georgila
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We use Gaussian Process Regression to predict different types of ratings provided by users after interacting with various task-oriented dialogue systems. We compare the performance of domain-independent dialogue features (e.g., duration, number of filled slots, number of confirmed slots, word error rate) with pre-trained dialogue embeddings. These pre-trained dialogue embeddings are computed by averaging over sentence embeddings in a dialogue. Sentence embeddings are created using various models based on sentence transformers (appearing on the Hugging Face Massive Text Embedding Benchmark leaderboard) or by averaging over BERT word embeddings (varying the BERT layers used). We also compare pre-trained embeddings extracted from human transcriptions with pre-trained embeddings extracted from speech recognition outputs, to determine the robustness of these models to errors. Our results show that overall, for most types of user satisfaction ratings and advanced/recent (or sometimes less advanced/recent) pre-trained embedding models, using only pre-trained embeddings outperforms using only domain-independent features. However, this pattern varies depending on the type of rating and the embedding model used. Also, pre-trained embeddings are found to be robust to speech recognition errors, more advanced/recent embedding models do not always perform better than less advanced/recent ones, and larger models do not necessarily outperform smaller ones. The best prediction performance is achieved by combining pre-trained embeddings with domain-independent features.

2022

pdf bib abs
Strategy-level Entrainment of Dialogue System Users in a Creative Visual Reference Resolution Task
Deepthi Karkada | Ramesh Manuvinakurike | Maike Paetzel-Prüsmann | Kallirroi Georgila
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we study entrainment of users playing a creative reference resolution game with an autonomous dialogue system. The language understanding module in our dialogue system leverages annotated human-wizard conversational data, openly available knowledge graphs, and crowd-augmented data. Unlike previous entrainment work, our dialogue system does not attempt to make the human conversation partner adopt lexical items in their dialogue, but rather to adapt their descriptive strategy to one that is simpler to parse for our natural language understanding unit. By deploying this dialogue system through a crowd-sourced study, we show that users indeed entrain on a “strategy-level” without the change of strategy impinging on their creativity. Our work thus presents a promising future research direction for developing dialogue management systems that can strategically influence people’s descriptive strategy to ease the system’s language understanding in creative tasks.

pdf bib abs
Evaluation of Off-the-shelf Speech Recognizers on Different Accents in a Dialogue Domain
Divya Tadimeti | Kallirroi Georgila | David Traum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems on dialogue agent-directed English speech from speakers with General American vs. non-American accents. Our results show that the performance of the ASR systems for non-American accents is considerably worse than for General American accents. Depending on the recognizer, the absolute difference in performance between General American accents and all non-American accents combined can vary approximately from 2% to 12%, with relative differences varying approximately between 16% and 49%. This drop in performance becomes even larger when we consider specific categories of non-American accents indicating a need for more diligent collection of and training on non-native English speaker data in order to narrow this performance gap. There are performance differences across ASR systems, and while the same general pattern holds, with more errors for non-American accents, there are some accents for which the best recognizer is different than in the overall case. We expect these results to be useful for dialogue system designers in developing more robust inclusive dialogue systems, and for ASR providers in taking into account performance requirements for different accents.

2020

pdf bib abs
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers
Kallirroi Georgila | Carla Gordon | Volodymyr Yanov | David Traum
Proceedings of the Twelfth Language Resources and Evaluation Conference

We collected a corpus of dialogues in a Wizard of Oz (WOz) setting in the Internet of Things (IoT) domain. We asked users participating in these dialogues to rate the system on a number of aspects, namely, intelligence, naturalness, personality, friendliness, their enjoyment, overall quality, and whether they would recommend the system to others. Then we asked dialogue observers, i.e., Amazon Mechanical Turkers (MTurkers), to rate these dialogues on the same aspects. We also generated simulated dialogues between dialogue policies and simulated users and asked MTurkers to rate them again on the same aspects. Using linear regression, we developed dialogue evaluation functions based on features from the simulated dialogues and the MTurkers’ ratings, the WOz dialogues and the MTurkers’ ratings, and the WOz dialogues and the WOz participants’ ratings. We applied all these dialogue evaluation functions to a held-out portion of our WOz dialogues, and we report results on the predictive power of these different types of dialogue evaluation functions. Our results suggest that for three conversational aspects (intelligence, naturalness, overall quality) just training evaluation functions on simulated data could be sufficient.

pdf bib abs
Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains
Kallirroi Georgila | Anton Leuski | Volodymyr Yanov | David Traum
Proceedings of the Twelfth Language Resources and Evaluation Conference

We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems across diverse dialogue domains (in US-English). Our evaluation is aimed at non-experts with limited experience in speech recognition. Our goal is not only to compare a variety of ASR systems on several diverse data sets but also to measure how much ASR technology has advanced since our previous large-scale evaluations on the same data sets. Our results show that the performance of each speech recognizer can vary significantly depending on the domain. Furthermore, despite major recent progress in ASR technology, current state-of-the-art speech recognizers perform poorly in domains that require special vocabulary and language models, and under noisy conditions. We expect that our evaluation will prove useful to ASR consumers and dialogue system designers.

2018

pdf bib
Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing
Ramesh Manuvinakurike | Jacqueline Brixey | Trung Bui | Walter Chang | Doo Soon Kim | Ron Artstein | Kallirroi Georgila
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
DialEdit: Annotations for Spoken Conversational Image Editing
Ramesh Manuvirakurike | Jacqueline Brixey | Trung Bui | Walter Chang | Ron Artstein | Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation

pdf bib
A Dialogue Annotation Scheme for Weight Management Chat using the Trans-Theoretical Model of Health Behavior Change
Ramesh Manuvirakurike | Sumanth Bharawadj | Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation

pdf bib
Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario
Deepthi Karkada | Ramesh Manuvirakurike | Kallirroi Georgila
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation

pdf bib abs
Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task
Ramesh Manuvinakurike | Trung Bui | Walter Chang | Kallirroi Georgila
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

We present “conversational image editing”, a novel real-world application domain combining dialogue, visual information, and the use of computer vision. We discuss the importance of dialogue incrementality in this task, and build various models for incremental intent identification based on deep learning and traditional classification algorithms. We show how our model based on convolutional neural networks outperforms models based on random forests, long short term memory networks, and conditional random fields. By training embeddings based on image-related dialogue corpora, we outperform pre-trained out-of-the-box embeddings, for intention identification tasks. Our experiments also provide evidence that incremental intent processing may be more efficient for the user and could save time in accomplishing tasks.

2017

pdf bib abs
Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game
Ramesh Manuvinakurike | David DeVault | Kallirroi Georgila
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

We apply Reinforcement Learning (RL) to the problem of incremental dialogue policy learning in the context of a fast-paced dialogue game. We compare the policy learned by RL with a high-performance baseline policy which has been shown to perform very efficiently (nearly as well as humans) in this dialogue game. The RL policy outperforms the baseline policy in offline simulations (based on real user data). We provide a detailed comparison of the RL policy and the baseline policy, including information about how much effort and time it took to develop each one of them. We also highlight the cases where the RL policy performs better, and show that understanding the RL policy can provide valuable insights which can inform the creation of an even better rule-based policy.

The current practice in virtual human dialogue systems is to use professional human recordings or limited-domain speech synthesis. Both approaches lead to good performance but at a high cost. To determine the best trade-off between performance and cost, we perform a systematic evaluation of human and synthesized voices with regard to naturalness, conversational aspect, and likability. We vary the type (in-domain vs. out-of-domain), length, and content of utterances, and take into account the age and native language of raters as well as their familiarity with speech synthesis. We present detailed results from two studies, a pilot one and one run on Amazon's Mechanical Turk. Our results suggest that a professional human voice can supersede both an amateur human voice and synthesized voices. Also, a high-quality general-purpose voice or a good limited-domain voice can perform better than amateur human recordings. We do not find any significant differences between the performance of a high-quality general-purpose voice and a limited-domain voice, both trained with speech recorded by actors. As expected, the high-quality general-purpose voice is rated higher than the limited-domain voice for out-of-domain sentences and lower for in-domain sentences. There is also a trend for long or negative-content utterances to receive lower ratings.

pdf bib
Reinforcement Learning of Question-Answering Dialogue Policies for Virtual Museum Guides
Teruhisa Misu | Kallirroi Georgila | Anton Leuski | David Traum
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2011

2010

pdf bib abs
Practical Evaluation of Speech Recognizers for Virtual Human Dialogue Systems
Xuchen Yao | Pravin Bhutada | Kallirroi Georgila | Kenji Sagae | Ron Artstein | David Traum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We perform a large-scale evaluation of multiple off-the-shelf speech recognizers across diverse domains for virtual human dialogue systems. Our evaluation is aimed at speech recognition consumers and potential consumers with limited experience with readily available recognizers. We focus on practical factors to determine what levels of performance can be expected from different available recognizers in various projects featuring different types of conversational utterances. Our results show that there is no single recognizer that outperforms all other recognizers in all domains. The performance of each recognizer may vary significantly depending on the domain, the size and perplexity of the corpus, the out-of-vocabulary rate, and whether acoustic and language model adaptation has been used or not. We expect that our evaluation will prove useful to other speech recognition consumers, especially in the dialogue community, and will shed some light on the key problem in spoken dialogue systems of selecting the most suitable available speech recognition system for a particular application, and what impact training will have.

pdf bib
Learning Dialogue Strategies from Older and Younger Simulated Users
Kallirroi Georgila | Maria Wolters | Johanna Moore
Proceedings of the SIGDIAL 2010 Conference

pdf bib
Cross-Domain Speech Disfluency Detection
Kallirroi Georgila | Ning Wang | Jonathan Gratch
Proceedings of the SIGDIAL 2010 Conference

2009

pdf bib
Using Integer Linear Programming for Detecting Speech Disfluencies
Kallirroi Georgila
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Evaluating the Effectiveness of Information Presentation in a Full End-To-End Dialogue System
Taghi Paksima | Kallirroi Georgila | Johanna Moore
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib
Hybrid Reinforcement/Supervised Learning of Dialogue Policies from Fixed Data Sets
James Henderson | Oliver Lemon | Kallirroi Georgila
Computational Linguistics, Volume 34, Number 4, December 2008

pdf bib abs
A Fully Annotated Corpus for Studying the Effect of Cognitive Ageing on Users’ Interactions with Spoken Dialogue Systems
Kallirroi Georgila | Maria Wolters | Vasilis Karaiskos | Melissa Kronenthal | Robert Logie | Neil Mayo | Johanna Moore | Matt Watson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present a corpus of interactions of older and younger users with nine different dialogue systems. The corpus has been fully transcribed and annotated with dialogue acts and Information State Update (ISU) representations of dialogue context. Users not only underwent a comprehensive battery of cognitive assessments, but they also rated the usability of each dialogue system on a standardised questionnaire. In this paper, we discuss the corpus collection and outline the semi-automatic methods we used for discourse-level annotations. We expect that the corpus will provide a key resource for modelling older peoples interaction with spoken dialogue systems.

pdf bib
Simulating the Behaviour of Older versus Younger Users when Interacting with Spoken Dialogue Systems
Kallirroi Georgila | Maria Wolters | Johanna Moore
Proceedings of ACL-08: HLT, Short Papers