Stefan Scherer


Analysis of Behavior Classification in Motivational Interviewing
Leili Tavabi | Trang Tran | Kalin Stefanov | Brian Borsari | Joshua Woolley | Stefan Scherer | Mohammad Soleymani
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Analysis of client and therapist behavior in counseling sessions can provide helpful insights for assessing the quality of the session and consequently, the client’s behavioral outcome. In this paper, we study the automatic classification of standardized behavior codes (annotations) used for assessment of psychotherapy sessions in Motivational Interviewing (MI). We develop models and examine the classification of client behaviors throughout MI sessions, comparing the performance by models trained on large pretrained embeddings (RoBERTa) versus interpretable and expert-selected features (LIWC). Our best performing model using the pretrained RoBERTa embeddings beats the baseline model, achieving an F1 score of 0.66 in the subject-independent 3-class classification. Through statistical analysis on the classification results, we identify prominent LIWC features that may not have been captured by the model using pretrained embeddings. Although classification using LIWC features underperforms RoBERTa, our findings motivate the future direction of incorporating auxiliary tasks in the classification of MI codes.


Unfolding the External Behavior and Inner Affective State of Teammates through Ensemble Learning: Experimental Evidence from a Dyadic Team Corpus
Aggeliki Vlachostergiou | Mark Dennison | Catherine Neubauer | Stefan Scherer | Peter Khooshabeh | Andre Harrison
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
What type of happiness are you looking for? - A closer look at detecting mental health from language
Alina Arseniev-Koehler | Sharon Mozgai | Stefan Scherer
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Computational models to detect mental illnesses from text and speech could enhance our understanding of mental health while offering opportunities for early detection and intervention. However, these models are often disconnected from the lived experience of depression and the larger diagnostic debates in mental health. This article investigates these disconnects, primarily focusing on the labels used to diagnose depression, how these labels are computationally represented, and the performance metrics used to evaluate computational models. We also consider how medical instruments used to measure depression, such as the Patient Health Questionnaire (PHQ), contribute to these disconnects. To illustrate our points, we incorporate mixed-methods analyses of 698 interviews on emotional health, which are coupled with self-report PHQ screens for depression. We propose possible strategies to bridge these gaps between modern psychiatric understandings of depression, lay experience of depression, and computational representation.

pdf bib
A Linguistically-Informed Fusion Approach for Multimodal Depression Detection
Michelle Morales | Stefan Scherer | Rivka Levitan
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Automated depression detection is inherently a multimodal problem. Therefore, it is critical that researchers investigate fusion techniques for multimodal design. This paper presents the first-ever comprehensive study of fusion techniques for depression detection. In addition, we present novel linguistically-motivated fusion techniques, which we find outperform existing approaches.

pdf bib
Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
Amir Zadeh | Paul Pu Liang | Louis-Philippe Morency | Soujanya Poria | Erik Cambria | Stefan Scherer
Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)

Modeling Temporality of Human Intentions by Domain Adaptation
Xiaolei Huang | Lixing Liu | Kate Carey | Joshua Woolley | Stefan Scherer | Brian Borsari
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Categorizing patient’s intentions in conversational assessment can help decision making in clinical treatments. Many conversation corpora span broaden a series of time stages. However, it is not clear that how the themes shift in the conversation impact on the performance of human intention categorization (eg., patients might show different behaviors during the beginning versus the end). This paper proposes a method that models the temporal factor by using domain adaptation on clinical dialogue corpora, Motivational Interviewing (MI). We deploy Bi-LSTM and topic model jointly to learn language usage change across different time sessions. We conduct experiments on the MI corpora to show the promising improvement after considering temporality in the classification task.


pdf bib
A Cross-modal Review of Indicators for Depression Detection Systems
Michelle Morales | Stefan Scherer | Rivka Levitan
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clinical Reality

Automatic detection of depression has attracted increasing attention from researchers in psychology, computer science, linguistics, and related disciplines. As a result, promising depression detection systems have been reported. This paper surveys these efforts by presenting the first cross-modal review of depression detection systems and discusses best practices and most promising approaches to this task.

Affect-LM: A Neural Language Model for Customizable Affective Text Generation
Sayan Ghosh | Mathieu Chollet | Eugene Laksana | Louis-Philippe Morency | Stefan Scherer
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human verbal communication includes affective messages which are conveyed through use of emotionally colored words. There has been a lot of research effort in this direction but the problem of integrating state-of-the-art neural language models with affective information remains an area ripe for exploration. In this paper, we propose an extension to an LSTM (Long Short-Term Memory) language model for generation of conversational text, conditioned on affect categories. Our proposed model, Affect-LM enables us to customize the degree of emotional content in generated sentences through an additional design parameter. Perception studies conducted using Amazon Mechanical Turk show that Affect-LM can generate naturally looking emotional sentences without sacrificing grammatical correctness. Affect-LM also learns affect-discriminative word representations, and perplexity experiments show that additional affective information in conversational text can improve language model prediction.


A Multimodal Corpus for the Assessment of Public Speaking Ability and Anxiety
Mathieu Chollet | Torsten Wörtwein | Louis-Philippe Morency | Stefan Scherer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The ability to efficiently speak in public is an essential asset for many professions and is used in everyday life. As such, tools enabling the improvement of public speaking performance and the assessment and mitigation of anxiety related to public speaking would be very useful. Multimodal interaction technologies, such as computer vision and embodied conversational agents, have recently been investigated for the training and assessment of interpersonal skills. Once central requirement for these technologies is multimodal corpora for training machine learning models. This paper addresses the need of these technologies by presenting and sharing a multimodal corpus of public speaking presentations. These presentations were collected in an experimental study investigating the potential of interactive virtual audiences for public speaking training. This corpus includes audio-visual data and automatically extracted features, measures of public speaking anxiety and personality, annotations of participants’ behaviors and expert ratings of behavioral aspects and overall performance of the presenters. We hope this corpus will help other research teams in developing tools for supporting public speaking training.


The Distress Analysis Interview Corpus of human and computer interviews
Jonathan Gratch | Ron Artstein | Gale Lucas | Giota Stratou | Stefan Scherer | Angela Nazarian | Rachel Wood | Jill Boberg | David DeVault | Stacy Marsella | David Traum | Skip Rizzo | Louis-Philippe Morency
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Distress Analysis Interview Corpus (DAIC) contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post traumatic stress disorder. The interviews are conducted by humans, human controlled agents and autonomous agents, and the participants include both distressed and non-distressed individuals. Data collected include audio and video recordings and extensive questionnaire responses; parts of the corpus have been transcribed and annotated for a variety of verbal and non-verbal features. The corpus has been used to support the creation of an automated interviewer agent, and for research on the automatic identification of psychological distress.


Verbal indicators of psychological distress in interactive dialogue with a virtual human
David DeVault | Kallirroi Georgila | Ron Artstein | Fabrizio Morbini | David Traum | Stefan Scherer | Albert Skip Rizzo | Louis-Philippe Morency
Proceedings of the SIGDIAL 2013 Conference


Vers un mesure automatique de l’adaptation prosodique en interaction conversationnelle (Automatic measurement of prosodic accommodation in conversational interaction) [in French]
Céline De Looze | Stefan Scherer | Brian Vaughan | Nick Campbell
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP

An audiovisual political speech analysis incorporating eye-tracking and perception data
Stefan Scherer | Georg Layher | John Kane | Heiko Neumann | Nick Campbell
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We investigate the influence of audiovisual features on the perception of speaking style and performance of politicians, utilizing a large publicly available dataset of German parliament recordings. We conduct a human perception experiment involving eye-tracker data to evaluate human ratings as well as behavior in two separate conditions, i.e. audiovisual and video only. The ratings are evaluated on a five dimensional scale comprising measures of insecurity, monotony, expressiveness, persuasiveness, and overall performance. Further, they are statistically analyzed and put into context in a multimodal feature analysis, involving measures of prosody, voice quality and motion energy. The analysis reveals several statistically significant features, such as pause timing, voice quality measures and motion energy, that highly positively or negatively correlate with certain human ratings of speaking style. Additionally, we compare the gaze behavior of the human subjects to evaluate saliency regions in the multimodal and visual only conditions. The eye-tracking analysis reveals significant changes in the gaze behavior of the human subjects; participants reduce their focus of attention in the audiovisual condition mainly to the region of the face of the politician and scan the upper body, including hands and arms, in the video only condition.


Developing an Expressive Speech Labeling Tool Incorporating the Temporal Characteristics of Emotion
Stefan Scherer | Ingo Siegert | Lutz Bigalke | Sascha Meudt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

A lot of research effort has been spent on the development of emotion theories and modeling, however, their suitability and applicability to expressions in human computer interaction has not exhaustively been evaluated. Furthermore, investigations concerning the ability of the annotators to map certain expressions onto the developed emotion models is lacking proof. The proposed annotation tool, which incorporates the standard Geneva Emotional Wheel developed by Klaus Scherer and a novel temporal characteristic description feature, is aiming towards enabling the annotator to label expressions recorded in human computer interaction scenarios on an utterance level. Further, it is respecting key features of realistic and natural emotional expressions, such as their sequentiality, temporal characteristics, their mixed occurrences, and their expressivity or clarity of perception. Additionally, first steps towards evaluating the proposed tool, by analyzing utterance annotations taken from two expressive speech corpora, are undertaken and some future goals including the open source accessibility of the tool are given.

An Open Source Process Engine Framework for Realtime Pattern Recognition and Information Fusion Tasks
Volker Fritzsch | Stefan Scherer | Friedhelm Schwenker
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The process engine for pattern recognition and information fusion tasks, the \emph{pepr framework}, aims to empower the researcher to develop novel solutions in the field of pattern recognition and information fusion tasks in a timely manner, by supporting reuse and combination of well tested and established components in an environment, that eases the wiring of distinct algorithms and description of the control flow through graphical tooling. The framework, not only consisting of the runtime environment, comes with several highly useful components that can be leveraged as a starting point in creating new solutions, as well as a graphical process builder that allows for easy development of pattern recognition processes in a graphical, modeled manner. Additionally, numerous work has been invested in order to keep the entry barrier with regards to extending the framework as low as possible, enabling developers to add additional functionality to the framework in as less time as possible.

Evaluation of the PIT Corpus Or What a Difference a Face Makes?
Petra-Maria Strauß | Stefan Scherer | Georg Layher | Holger Hoffmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the evaluation of the PIT Corpus of multi-party dialogues recorded in a Wizard-of-Oz environment. An evaluation has been performed with two different foci: First, a usability evaluation was used to take a look at the overall ratings of the system. A shortened version of the SASSI questionnaire, namely the SASSISV, and the well established AttrakDiff questionnaire assessing the hedonistic and pragmatic dimension of computer systems have been analysed. In a second evaluation, the user's gaze direction was analysed in order to assess the difference in the user's (gazing) behaviour if interacting with the computer versus the other dialogue partner. Recordings have been performed in different setups of the system, e.g. with and without avatar. Thus, the presented evaluation further focuses on the difference in the interaction caused by deploying an avatar. The quantitative analysis of the gazing behaviour has resulted in several encouraging significant differences. As a possible interpretation it could be argued that users are more attentive towards systems with an avatar - the difference a face makes.


The PIT Corpus of German Multi-Party Dialogues
Petra-Maria Strauß | Holger Hoffmann | Wolfgang Minker | Heiko Neumann | Günther Palm | Stefan Scherer | Harald Traue | Ulrich Weidenbacher
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The PIT corpus is a German multi-media corpus of multi-party dialogues recorded in a Wizard-of-Oz environment at the University of Ulm. The scenario involves two human dialogue partners interacting with a multi-modal dialogue system in the domain of restaurant selection. In this paper we present the characteristics of the data which was recorded in three sessions resulting in a total of 75 dialogues and about 14 hours of audio and video data. The corpus is available at

Emotion Recognition from Speech: Stress Experiment
Stefan Scherer | Hansjörg Hofmann | Malte Lampmann | Martin Pfeil | Steffen Rhinow | Friedhelm Schwenker | Günther Palm
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The goal of this work is to introduce an architecture to automatically detect the amount of stress in the speech signal close to real time. For this an experimental setup to record speech rich in vocabulary and containing different stress levels is presented. Additionally, an experiment explaining the labeling process with a thorough analysis of the labeled data is presented. Fifteen subjects were asked to play an air controller simulation that gradually induced more stress by becoming more difficult to control. During this game the subjects were asked to answer questions, which were then labeled by a different set of subjects in order to receive a subjective target value for each of the answers. A recurrent neural network was used to measure the amount of stress contained in the utterances after training. The neural network estimated the amount of stress at a frequency of 25 Hz and outperformed the human baseline.

A Flexible Wizard of Oz Environment for Rapid Prototyping
Stefan Scherer | Petra-Maria Strauß
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a freely-available, and flexible Wizard of Oz environment for rapid prototyping. The system is designed to investigate the required features of a dialog system using the commonly used Wizard of Oz approach. The idea is that the time consuming design of such a tool can be avoided by using the provided architecture. The developers can easily adapt the database and extend the tool to the individual needs of the targeted dialog system. The tool is designed as a client-server architecture and provides efficient input features and versatile output types including voice, or an avatar as visual output. Furthermore, a scenario, namely restaurant selection, is introduced in order to give an example application for a dialog system.


Wizard-of-Oz Data Collection for Perception and Interaction in Multi-User Environments
Petra-Maria Strauß | Holger Hoffman | Wolfgang Minker | Heiko Neumann | Günther Palm | Stefan Scherer | Friedhelm Schwenker | Harald Traue | Welf Walter | Ulrich Weidenbacher
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present the setup of an extensive Wizard-of-Oz environment used for the data collection and the development of a dialogue system. The envisioned Perception and Interaction Assistant will act as an independent dialogue partner. Passively observing the dialogue between the two human users with respect to a limited domain, the system should take the initiative and get meaningfully involved in the communication process when required by the conversational situation. The data collection described here involves audio and video data. We aim at building a rich multi-media data corpus to be used as a basis for our research which includes, inter alia, speech and gaze direction recognition, dialogue modelling and proactivity of the system. We further aspire to obtain data with emotional content to perfom research on emotion recognition, psychopysiological and usability analysis.