Alexander Schmitt


Could Speaker, Gender or Age Awareness be beneficial in Speech-based Emotion Recognition?
Maxim Sidorov | Alexander Schmitt | Eugene Semenkin | Wolfgang Minker
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Emotion Recognition (ER) is an important part of dialogue analysis which can be used in order to improve the quality of Spoken Dialogue Systems (SDSs). The emotional hypothesis of the current response of an end-user might be utilised by the dialogue manager component in order to change the SDS strategy which could result in a quality enhancement. In this study additional speaker-related information is used to improve the performance of the speech-based ER process. The analysed information is the speaker identity, gender and age of a user. Two schemes are described here, namely, using additional information as an independent variable within the feature vector and creating separate emotional models for each speaker, gender or age-cluster independently. The performances of the proposed approaches were compared against the baseline ER system, where no additional information has been used, on a number of emotional speech corpora of German, English, Japanese and Russian. The study revealed that for some of the corpora the proposed approach significantly outperforms the baseline methods with a relative difference of up to 11.9%.


Quality-adaptive Spoken Dialogue Initiative Selection And Implications On Reward Modelling
Stefan Ultes | Matthias Kraus | Alexander Schmitt | Wolfgang Minker
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue


Comparison of Gender- and Speaker-adaptive Emotion Recognition
Maxim Sidorov | Stefan Ultes | Alexander Schmitt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Deriving the emotion of a human speaker is a hard task, especially if only the audio stream is taken into account. While state-of-the-art approaches already provide good results, adaptive methods have been proposed in order to further improve the recognition accuracy. A recent approach is to add characteristics of the speaker, e.g., the gender of the speaker. In this contribution, we argue that adding information unique for each speaker, i.e., by using speaker identification techniques, improves emotion recognition simply by adding this additional information to the feature vector of the statistical classification algorithm. Moreover, we compare this approach to emotion recognition adding only the speaker gender being a non-unique speaker attribute. We justify this by performing adaptive emotion recognition using both gender and speaker information on four different corpora of different languages containing acted and non-acted speech. The final results show that adding speaker information significantly outperforms both adding gender information and solely using a generic speaker-independent approach.


On Quality Ratings for Spoken Dialogue Systems – Experts vs. Users
Stefan Ultes | Alexander Schmitt | Wolfgang Minker
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


A Parameterized and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information System
Alexander Schmitt | Stefan Ultes | Wolfgang Minker
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Standardized corpora are the foundation for spoken language research. In this work, we introduce an annotated and standardized corpus in the Spoken Dialog Systems (SDS) domain. Data from the Let's Go Bus Information System from the Carnegie Mellon University in Pittsburgh has been formatted, parameterized and annotated with quality, emotion, and task success labels containing 347 dialogs with 9,083 system-user exchanges. A total of 46 parameters have been derived automatically and semi-automatically from Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU) and Dialog Manager (DM) properties. To each spoken user utterance an emotion label from the set garbage, non-angry, slightly angry, very angry has been assigned. In addition, a manual annotation of Interaction Quality (IQ) on the exchange level has been performed with three raters achieving a Kappa value of 0.54. The IQ score expresses the quality of the interaction up to each system-user exchange on a score from 1-5. The presented corpus is intended as a standardized basis for classification and evaluation tasks regarding task success prediction, dialog quality estimation or emotion recognition to foster comparability between different approaches on these fields.

Towards Quality-Adaptive Spoken Dialogue Management
Stefan Ultes | Alexander Schmitt | Wolfgang Minker
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)


Modeling and Predicting Quality in Spoken Human-Computer Interaction
Alexander Schmitt | Benjamin Schatz | Wolfgang Minker
Proceedings of the SIGDIAL 2011 Conference


Advances in the Witchcraft Workbench Project
Alexander Schmitt | Wolfgang Minker | Nada Sharaf
Proceedings of the SIGDIAL 2010 Conference

WITcHCRafT: A Workbench for Intelligent exploraTion of Human ComputeR conversaTions
Alexander Schmitt | Gregor Bertrand | Tobias Heinroth | Wolfgang Minker | Jackson Liscombe
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present Witchcraft, an open-source framework for the evaluation of prediction models for spoken dialogue systems based on interaction logs and audio recordings. The use of Witchcraft is two fold: first, it provides an adaptable user interface to easily manage and browse thousands of logged dialogues (e.g. calls). Second, with help of the underlying models and the connected machine learning framework RapidMiner the workbench is able to display at each dialogue turn the probability of the task being completed based on the dialogue history. It estimates the emotional state, gender and age of the user. While browsing through a logged conversation, the user can directly observe the prediction result of the models at each dialogue step. By that, Witchcraft allows for spotting problematic dialogue situations and demonstrates where the current system and the prediction models have design flaws. Witchcraft will be made publically available to the community and will be deployed as open-source project.

The Influence of the Utterance Length on the Recognition of Aged Voices
Alexander Schmitt | Tim Polzehl | Wolfgang Minker | Jackson Liscombe
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper addresses the recognition of elderly callers based on short and narrow-band utterances, which are typical for Interactive Voice Response (IVR) systems. Our study is based on 2308 short utterances from a deployed IVR application. We show that features such as speaking rate, jitter and shimmer that are considered as most meaningful ones for determining elderly users underperform when used in the IVR context while pitch and intensity features seem to gain importance. We further demonstrate the influence of the utterance length on the classifier’s performance: for both humans and classifier, the distinction between aged and non-aged voices becomes increasingly difficult the shorter the utterances get. Our setup based on a Support Vector Machine (SVM) with linear kernel reaches a comparably poor performance of 58% accuracy, which can be attributed to an average utterance length of only 1.6 seconds. The automatic distinction between aged and non-aged utterances drops to random when the utterance length falls below 1.2 seconds.

Efficient Spoken Dialogue Domain Representation and Interpretation
Tobias Heinroth | Dan Denich | Alexander Schmitt | Wolfgang Minker
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We provide a detailed look on the functioning of the OwlSpeak Spoken Dialogue Manager, which is part of the EU-funded project ATRACO. OwlSpeak interprets Spoken Dialogue Ontologies and on this basis generates VoiceXML dialogue snippets. The dialogue snippets can be interpreted by all speech servers that provide VoiceXML support and therefore make the dialogue management independent from the hosting systems providing speech recognition and synthesis. Ontologies are used within the framework of our prototype to represent specific spoken dialogue domains that can dynamically be broadened or tightened during an ongoing dialogue. We provide an exemplary dialogue encoded as OWL model and explain how this model is interpreted by the dialogue manager. The combination of a unified model for dialogue domains and the strict model-view-controller architecture that underlies the dialogue manager lead to an efficient system that allows for a new way of spoken dialogue system development and can be used for further research on adaptive spoken dialogue strategies.


On NoMatchs, NoInputs and BargeIns: Do Non-Acoustic Features Support Anger Detection?
Alexander Schmitt | Tobias Heinroth | Jackson Liscombe
Proceedings of the SIGDIAL 2009 Conference