Most systems helping to provide structured information and support opinion building, discuss with users without considering their individual interest. The scarce existing research on user interest in dialogue systems depends on explicit user feedback. Such systems require user responses that are not content-related and thus, tend to disturb the dialogue flow. In this paper, we present a novel model for implicitly estimating user interest during argumentative dialogues based on semantically clustered data. Therefore, an online user study was conducted to acquire training data which was used to train a binary neural network classifier in order to predict whether or not users are still interested in the content of the ongoing dialogue. We achieved a classification accuracy of 74.9% and furthermore investigated with different Artificial Neural Networks (ANN) which new argument would fit the user interest best.
Speech interfaces for argumentative dialogue systems (ADS) are rather scarce. The complex task they pursue hinders the application of common natural language understanding (NLU) approaches in this domain. To address this issue we include an adaption of a recently introduced NLU framework tailored to argumentative tasks into a complete ADS. We evaluate the likeability and motivation of users to interact with the new system in a user study. Therefore, we compare it to a solid baseline utilizing a drop-down menu. The results indicate that the integration of a flexible NLU framework enables a far more natural and satisfying interaction with human users in real-time. Even though the drop-down menu convinces regarding its robustness, the willingness to use the new system is significantly higher. Hence, the featured NLU framework provides a sound basis to build an intuitive interface which can be extended to adapt its behavior to the individual user.
The growing popularity of various forms of Spoken Dialogue Systems (SDS) raises the demand for their capability of implicitly assessing the speaker’s sentiment from speech only. Mapping the latter on user preferences enables to adapt to the user and individualize the requested information while increasing user satisfaction. In this paper, we explore the integration of rank consistent ordinal regression into a speech-only sentiment prediction task performed by ResNet-like systems. Furthermore, we use speaker verification extractors trained on larger datasets as low-level feature extractors. An improvement of performance is shown by fusing sentiment and pre-extracted speaker embeddings reducing the speaker bias of sentiment predictions. Numerous experiments on Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) databases show that we beat the baselines of state-of-the-art unimodal approaches. Using speech as the only modality combined with optimizing an order-sensitive objective function gets significantly closer to the sentiment analysis results of state-of-the-art multimodal systems.
Robots will eventually enter our daily lives and assist with a variety of tasks. Especially in the household domain, robots may become indispensable helpers by overtaking tedious tasks, e.g. keeping the place tidy. Their effectiveness and efficiency, however, depend on their ability to adapt to our needs, routines, and personal characteristics. Otherwise, they may not be accepted and trusted in our private domain. For enabling adaptation, the interaction between a human and a robot needs to be personalized. Therefore, the robot needs to collect personal information from the user. However, it is unclear how such sensitive data can be collected in an understandable way without losing a user’s trust in the system. In this paper, we present a conversational approach for explicitly collecting personal user information using natural dialogue. For creating a sound interactive personalization, we have developed an empathy-augmented dialogue strategy. In an online study, the empathy-augmented strategy was compared to a baseline dialogue strategy for interactive personalization. We have found the empathy-augmented strategy to perform notably friendlier. Overall, using dialogue for interactive personalization has generally shown positive user reception.
To build a well-founded opinion it is natural for humans to gather and exchange new arguments. Especially when being confronted with an overwhelming amount of information, people tend to focus on only the part of the available information that fits into their current beliefs or convenient opinions. To overcome this “self-imposed filter bubble” (SFB) in the information seeking process, it is crucial to identify influential indicators for the former. Within this paper we propose and investigate indicators for the the user’s SFB, mainly their Reflective User Engagement (RUE), their Personal Relevance (PR) ranking of content-related subtopics as well as their False (FK) and True Knowledge (TK) on the topic. Therefore, we analysed the answers of 202 participants of an online conducted user study, who interacted with our argumentative dialogue system BEA (“Building Engaging Argumentation”). Moreover, also the influence of different input/output modalities (speech/speech and drop-down menu/text) on the interaction with regard to the suggested indicators was investigated.
This work combines information about the dialogue history encoded by pre-trained model with a meaning representation of the current system utterance to realise contextual language generation in task-oriented dialogues. We utilise the pre-trained multi-context ConveRT model for context representation in a model trained from scratch; and leverage the immediate preceding user utterance for context generation in a model adapted from the pre-trained GPT-2. Both experiments with the MultiWOZ dataset show that contextual information encoded by pre-trained model improves the performance of response generation both in automatic metrics and human evaluation. Our presented contextual generator enables higher variety of generated responses that fit better to the ongoing dialogue. Analysing the context size shows that longer context does not automatically lead to better performance, but the immediate preceding user utterance plays an essential role for contextual generation. In addition, we also propose a re-ranker for the GPT-based generation model. The experiments show that the response selected by the re-ranker has a significant improvement on automatic metrics.
This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems. While this task was previously rendered through expensive and time-consuming human labor, we present this novel task of automatic naturalness evaluation of generated language. By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results and outperforms the baselines: support vector machines, bi-directional LSTMs, and BLEURT. In addition, the training speed and evaluation performance of naturalness model are improved by transfer learning from quality and informativeness linguistic knowledge.
Despite the remarkable progress in the field of computational argumentation, dialogue systems concerned with argumentative tasks often rely on structured knowledge about arguments and their relations. Since the manual acquisition of these argument structures is highly time-consuming, the corresponding systems are inflexible regarding the topics they can discuss. To address this issue, we propose a combination of argumentative dialogue systems with argument search technology that enables a system to discuss any topic on which the search engine is able to find suitable arguments. Our approach utilizes supervised learning-based relation classification to map the retrieved arguments into a general tree structure for use in dialogue systems. We evaluate the approach with a state of the art search engine and a recently introduced dialogue model in an extensive user study with respect to the dialogue coherence. The results vary between the investigated topics (and hence depend on the quality of the underlying data) but are in some instances surprisingly close to the results achieved with a manually annotated argument structure.
Recommendation systems aim at facilitating information retrieval for users by taking into account their preferences. Based on previous user behaviour, such a system suggests items or provides information that a user might like or find useful. Nonetheless, how to provide suggestions is still an open question. Depending on the way a recommendation is communicated influences the user’s perception of the system. This paper presents an empirical study on the effects of proactive dialogue strategies on user acceptance. Therefore, an explicit strategy based on user preferences provided directly by the user, and an implicit proactive strategy, using autonomously gathered information, are compared. The results show that proactive dialogue systems significantly affect the perception of human-computer interaction. Although no significant differences are found between implicit and explicit strategies, proactivity significantly influences the user experience compared to reactive system behaviour. The study contributes new insights to the human-agent interaction and the voice user interface design. Furthermore, we discover interesting tendencies that motivate futurework.
Nowadays Personal Assistants (PAs) are available in multiple environments and become increasingly popular to use via voice. Therefore, we aim to provide proactive PA suggestions to car drivers via speech. These suggestions should be neither obtrusive nor increase the drivers’ cognitive load, while enhancing user experience. To assess these factors, we conducted a usability study in which 42 participants perceive proactive voice output in a Wizard-of-Oz study in a driving simulator. Traffic density was varied during a highway drive and it included six in-car-specific use cases. The latter were presented by a proactive voice assistant and in a non-proactive control condition. We assessed the users’ subjective cognitive load and their satisfaction in different questionnaires during the interaction with both PA variants. Furthermore, we analyze the user reactions: both regarding their content and the elapsed response times to PA actions. The results show that proactive assistant behavior is rated similarly positive as non-proactive behavior. Furthermore, the participants agreed to 73.8% of proactive suggestions. In line with previous research, driving-relevant use cases receive the best ratings, here we reach 82.5% acceptance. Finally, the users reacted significantly faster to proactive PA actions, which we interpret as less cognitive load compared to non-proactive behavior.
We present an approach to evaluate argument search techniques in view of their use in argumentative dialogue systems by assessing quality aspects of the retrieved arguments. To this end, we introduce a dialogue system that presents arguments by means of a virtual avatar and synthetic speech to users and allows them to rate the presented content in four different categories (Interesting, Convincing, Comprehensible, Relation). The approach is applied in a user study in order to compare two state of the art argument search engines to each other and with a system based on traditional web search. The results show a significant advantage of the two search engines over the baseline. Moreover, the two search engines show significant advantages over each other in different categories, thereby reflecting strengths and weaknesses of the different underlying techniques.
We present a neural network approach to estimate the communication style of spoken interaction, namely the stylistic variations elaborateness and directness, and investigate which type of input features to the estimator are necessary to achive good performance. First, we describe our annotated corpus of recordings in the health care domain and analyse the corpus statistics in terms of agreement, correlation and reliability of the ratings. We use this corpus to estimate the elaborateness and the directness of each utterance. We test different feature sets consisting of dialogue act features, grammatical features and linguistic features as input for our classifier and perform classification in two and three classes. Our classifiers use only features that can be automatically derived during an ongoing interaction in any spoken dialogue system without any prior annotation. Our results show that the elaborateness can be classified by only using the dialogue act and the amount of words contained in the corresponding utterance. The directness is a more difficult classification task and additional linguistic features in form of word embeddings improve the classification results. Afterwards, we run a comparison with a support vector machine and a recurrent neural network classifier.
Paraphrasing is an important aspect of natural-language generation that can produce more variety in the way specific content is presented. Traditionally, paraphrasing has been focused on finding different words that convey the same meaning. However, in human-human interaction, we regularly express our intention with phrases that are vastly different regarding both word content and syntactic structure. Instead of exchanging only individual words, the complete surface realisation of a sentences is altered while still preserving its meaning and function in a conversation. This kind of contextual paraphrasing did not yet receive a lot of attention from the scientific community despite its potential for the creation of more varied dialogues. In this work, we evaluate several existing approaches to sentence encoding with regard to their ability to capture such context-dependent paraphrasing. To this end, we define a paraphrase classification task that incorporates contextual paraphrases, perform dialogue act clustering, and determine the performance of the sentence embeddings in a sentence swapping task.
Acoustic addressee detection (AD) is a modern paralinguistic and dialogue challenge that especially arises in voice assistants. In the present study, we distinguish addressees in two settings (a conversation between several people and a spoken dialogue system, and a conversation between several adults and a child) and introduce the first competitive baseline (unweighted average recall equals 0.891) for the Voice Assistant Conversation Corpus that models the first setting. We jointly solve both classification problems, using three models: a linear support vector machine dealing with acoustic functionals and two neural networks utilising raw waveforms alongside with acoustic low-level descriptors. We investigate how different corpora influence each other, applying the mixup approach to data augmentation. We also study the influence of various acoustic context lengths on AD. Two-second speech fragments turn out to be sufficient for reliable AD. Mixup is shown to be beneficial for merging acoustic data (extracted features but not raw waveforms) from different domains that allows us to reach a higher classification performance on human-machine AD and also for training a multipurpose neural network that is capable of solving both human-machine and adult-child AD problems.
For estimating the Interaction Quality (IQ) in Spoken Dialogue Systems (SDS), the dialogue history is of significant importance. Previous works included this information manually in the form of precomputed temporal features into the classification process. Here, we employ a deep learning architecture based on Long Short-Term Memories (LSTM) to extract this information automatically from the data, thus estimating IQ solely by using current exchange features. We show that it is thereby possible to achieve competitive results as in a scenario where manually optimized temporal features have been included.
In a dialogue system, the dialogue manager selects one of several system actions and thereby determines the system’s behaviour. Defining all possible system actions in a dialogue system by hand is a tedious work. While efforts have been made to automatically generate such system actions, those approaches are mostly focused on providing functional system behaviour. Adapting the system behaviour to the user becomes a difficult task due to the limited amount of system actions available. We aim to increase the adaptability of a dialogue system by automatically generating variants of system actions. In this work, we introduce an approach to automatically generate action variants for elaborateness and indirectness. Our proposed algorithm extracts RDF triplets from a knowledge base and rates their relevance to the original system action to find suitable content. We show that the results of our algorithm are mostly perceived similarly to human generated elaborateness and indirectness and can be used to adapt a conversation to the current user and situation. We also discuss where the results of our algorithm are still lacking and how this could be improved: Taking into account the conversation topic as well as the culture of the user is likely to have beneficial effect on the user’s perception.
Emotion Recognition (ER) is an important part of dialogue analysis which can be used in order to improve the quality of Spoken Dialogue Systems (SDSs). The emotional hypothesis of the current response of an end-user might be utilised by the dialogue manager component in order to change the SDS strategy which could result in a quality enhancement. In this study additional speaker-related information is used to improve the performance of the speech-based ER process. The analysed information is the speaker identity, gender and age of a user. Two schemes are described here, namely, using additional information as an independent variable within the feature vector and creating separate emotional models for each speaker, gender or age-cluster independently. The performances of the proposed approaches were compared against the baseline ER system, where no additional information has been used, on a number of emotional speech corpora of German, English, Japanese and Russian. The study revealed that for some of the corpora the proposed approach significantly outperforms the baseline methods with a relative difference of up to 11.9%.
The paper describes a comparative study of existing and novel text preprocessing and classification techniques for domain detection of user utterances. Two corpora are considered. The first one contains customer calls to a call centre for further call routing; the second one contains answers of call centre employees with different kinds of customer orientation behaviour. Seven different unsupervised and supervised term weighting methods were applied. The collective use of term weighting methods is proposed for classification effectiveness improvement. Four different dimensionality reduction methods were applied: stop-words filtering with stemming, feature selection based on term weights, feature transformation based on term clustering, and a novel feature transformation method based on terms belonging to classes. As classification algorithms we used k-NN and a SVM-based algorithm. The numerical experiments have shown that the simultaneous use of the novel proposed approaches (collectives of term weighting methods and the novel feature transformation method) allows reaching the high classification results with very small number of features.
While Spoken Dialogue Systems have gained in importance in recent years, most systems applied in the real world are still static and error-prone. To overcome this, the user is put into the focus of dialogue management. Hence, an approach for adapting the course of the dialogue to Interaction Quality, an objective variant of user satisfaction, is presented in this work. In general, rendering the dialogue adaptive to user satisfaction enables the dialogue system to improve the course of the dialogue and to handle problematic situations better. In this contribution, we present a pilot study of quality-adaptive dialogue. By selecting the confirmation strategy based on the current IQ value, the course of the dialogue is adapted in order to improve the overall user experience. In a user experiment comparing three different confirmation strategies in a train booking domain, the adaptive strategy performs successful and is among the two best rated strategies based on the overall user experience.
Automated emotion recognition has a number of applications in Interactive Voice Response systems, call centers, etc. While employing existing feature sets and methods for automated emotion recognition has already achieved reasonable results, there is still a lot to do for improvement. Meanwhile, an optimal feature set, which should be used to represent speech signals for performing speech-based emotion recognition techniques, is still an open question. In our research, we tried to figure out the most essential features with self-adaptive multi-objective genetic algorithm as a feature selection technique and a probabilistic neural network as a classifier. The proposed approach was evaluated using a number of multi-languages databases (English, German), which were represented by 37- and 384-dimensional feature sets. According to the obtained results, the developed technique allows to increase the emotion recognition performance by up to 26.08 relative improvement in accuracy. Moreover, emotion recognition performance scores for all applied databases are improved.
In this paper we present three approaches towards adaptive speech understanding. The target system is a model-based Adaptive Spoken Dialogue Manager, the OwlSpeak ASDM. We enhanced this system in order to properly react on non-understandings in real-life situations where intuitive communication is required. OwlSpeak provides a model-based spoken interface to an Intelligent Environment depending on and adapting to the current context. It utilises a set of ontologies used as dialogue models that can be combined dynamically during runtime. Besides the benefits the system showed in practice, real-life evaluations also conveyed some limitations of the model-based approach. Since it is unfeasible to model all variations of the communication between the user and the system beforehand, various situations where the system did not correctly understand the user input have been observed. Thus we present three enhancements towards a more sophisticated use of the ontology-based dialogue models and show how grammars may dynamically be adapted in order to understand intuitive user utterances. The evaluation of our approaches revealed the incorporation of a lexical-semantic knowledgebase into the recognition process to be the most promising approach.
In this work we show that there is a need of using multimodal resources during human-computer interaction (HCI) in intelligent systems. We propose that not only creating multimodal output for the user is important, but to take multimodal input resources into account for the decision when and how to interact. Especially the use of multimodal input resources for the decision when and how to provide assistance in HCI is important. The use of assistive functionalities like providing adaptive explanations to keep the user motivated and cooperative is more than a side-effect and demands a closer look. In this paper we introduce our approach on how to use multimodal input ressources in an adaptive and generic explanation pipeline. We do not only concentrate on using explanations as a way to manage user knowledge, but to maintain the cooperativeness, trust and motivation of the user to continue a healthy and well-structured HCI.
Standardized corpora are the foundation for spoken language research. In this work, we introduce an annotated and standardized corpus in the Spoken Dialog Systems (SDS) domain. Data from the Let's Go Bus Information System from the Carnegie Mellon University in Pittsburgh has been formatted, parameterized and annotated with quality, emotion, and task success labels containing 347 dialogs with 9,083 system-user exchanges. A total of 46 parameters have been derived automatically and semi-automatically from Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU) and Dialog Manager (DM) properties. To each spoken user utterance an emotion label from the set garbage, non-angry, slightly angry, very angry has been assigned. In addition, a manual annotation of Interaction Quality (IQ) on the exchange level has been performed with three raters achieving a Kappa value of 0.54. The IQ score expresses the quality of the interaction up to each system-user exchange on a score from 1-5. The presented corpus is intended as a standardized basis for classification and evaluation tasks regarding task success prediction, dialog quality estimation or emotion recognition to foster comparability between different approaches on these fields.
In this paper we investigated differences in language use of speakers yielding different verbal intelligence when they describe the same event. The work is based on a corpus containing descriptions of a short film and verbal intelligence scores of the speakers. For analyzing the monologues and the film transcript, the number of reused words, lemmas, n-grams, cosine similarity and other features were calculated and compared to each other for different verbal intelligence groups. The results showed that the similarity of monologues of higher verbal intelligence speakers was greater than of lower and average verbal intelligence participants. A possible explanation of this phenomenon is that candidates yielding higher verbal intelligence have a good short-term memory. In this paper we also checked a hypothesis that differences in vocabulary of speakers yielding different verbal intelligence are sufficient enough for good classification results. For proving this hypothesis, the Nearest Neighbor classifier was trained using TF-IDF vocabulary measures. The maximum achieved accuracy was 92.86%.
In this work we investigated whether there is a relationship between dominant behaviour of dialogue participants and their verbal intelligence. The analysis is based on a corpus containing 56 dialogues and verbal intelligence scores of the test persons. All the dialogues were divided into three groups: H-H is a group of dialogues between higher verbal intelligence participants, L-L is a group of dialogues between lower verbal intelligence participant and L-H is a group of all the other dialogues. The dominance scores of the dialogue partners from each group were analysed. The analysis showed that differences between dominance scores and verbal intelligence coefficients for L-L were positively correlated. Verbal intelligence scores of the test persons were compared to other features that may reflect dominant behaviour. The analysis showed that number of interruptions, long utterances, times grabbed the floor, influence diffusion model, number of agreements and several acoustic features may be related to verbal intelligence. These features were used for the automatic classification of the dialogue partners into two groups (lower and higher verbal intelligence participants); the achieved accuracy was 89.36%.
A syllable-based language model reduces the lexicon size by hundreds of times. It is especially beneficial in case of highly inflective languages like Russian due to the abundance of word forms according to various grammatical categories. However, the main arising challenge is the concatenation of recognised syllables into the originally spoken sentence or phrase, particularly in the presence of syllable recognition mistakes. Natural fluent speech does not usually incorporate clear information about the outside borders of the spoken words. In this paper a method for the syllable concatenation and error correction is suggested and tested. It is based on the designed co-evolutionary asymptotic probabilistic genetic algorithm for the determination of the most likely sentence corresponding to the recognized chain of syllables within an acceptable time frame. The advantage of this genetic algorithm modification is the minimum number of settings to be manually adjusted comparing to the standard algorithm. Data used for acoustic and language modelling are also described here. A special issue is the preprocessing of the textual data, particularly, handling of abbreviations, Arabic and Roman numerals, since their inflection mostly depends on the context and grammar.
We present Witchcraft, an open-source framework for the evaluation of prediction models for spoken dialogue systems based on interaction logs and audio recordings. The use of Witchcraft is two fold: first, it provides an adaptable user interface to easily manage and browse thousands of logged dialogues (e.g. calls). Second, with help of the underlying models and the connected machine learning framework RapidMiner the workbench is able to display at each dialogue turn the probability of the task being completed based on the dialogue history. It estimates the emotional state, gender and age of the user. While browsing through a logged conversation, the user can directly observe the prediction result of the models at each dialogue step. By that, Witchcraft allows for spotting problematic dialogue situations and demonstrates where the current system and the prediction models have design flaws. Witchcraft will be made publically available to the community and will be deployed as open-source project.
This paper addresses the recognition of elderly callers based on short and narrow-band utterances, which are typical for Interactive Voice Response (IVR) systems. Our study is based on 2308 short utterances from a deployed IVR application. We show that features such as speaking rate, jitter and shimmer that are considered as most meaningful ones for determining elderly users underperform when used in the IVR context while pitch and intensity features seem to gain importance. We further demonstrate the influence of the utterance length on the classifiers performance: for both humans and classifier, the distinction between aged and non-aged voices becomes increasingly difficult the shorter the utterances get. Our setup based on a Support Vector Machine (SVM) with linear kernel reaches a comparably poor performance of 58% accuracy, which can be attributed to an average utterance length of only 1.6 seconds. The automatic distinction between aged and non-aged utterances drops to random when the utterance length falls below 1.2 seconds.
We provide a detailed look on the functioning of the OwlSpeak Spoken Dialogue Manager, which is part of the EU-funded project ATRACO. OwlSpeak interprets Spoken Dialogue Ontologies and on this basis generates VoiceXML dialogue snippets. The dialogue snippets can be interpreted by all speech servers that provide VoiceXML support and therefore make the dialogue management independent from the hosting systems providing speech recognition and synthesis. Ontologies are used within the framework of our prototype to represent specific spoken dialogue domains that can dynamically be broadened or tightened during an ongoing dialogue. We provide an exemplary dialogue encoded as OWL model and explain how this model is interpreted by the dialogue manager. The combination of a unified model for dialogue domains and the strict model-view-controller architecture that underlies the dialogue manager lead to an efficient system that allows for a new way of spoken dialogue system development and can be used for further research on adaptive spoken dialogue strategies.
We describe an experimentalWizard-of-Oz-setup for the integration of emotional strategies into spoken dialogue management. With this setup we seek to evaluate different approaches to emotional dialogue strategies in human computer interaction with a spoken dialogue system. The study aims to analyse what kinds of emotional strategies work best in spoken dialogue management especially facing the problem that users may not be honest about their emotions. Therefore as well direct (user is asked about his state) as indirect (measurements of psychophysiological features) evidence is considered for the evaluation of our strategies.
The goal of our research is the development of algorithms for automatic estimation of a person's verbal intelligence based on the analysis of transcribed spoken utterances. In this paper we present the corpus of German native speakers' monologues and dialogues about the same topics collected at the University of Ulm, Germany. The monologues were descriptions of two short films; the dialogues were discussions about problems of German education. The data corpus contains the verbal intelligence quotients of each speaker, which were measured with the Hamburg Wechsler Intelligence Test for Adults. In this paper we describe our corpus, why we decided to create it, and how it was collected. We also describe some approaches which can be applied to the transcribed spoken utterances for extraction of different features which could have a correlation with a person's verbal intelligence. The data corpus consists of 71 monologues and 30 dialogues (about 10 hours of audio data).
The PIT corpus is a German multi-media corpus of multi-party dialogues recorded in a Wizard-of-Oz environment at the University of Ulm. The scenario involves two human dialogue partners interacting with a multi-modal dialogue system in the domain of restaurant selection. In this paper we present the characteristics of the data which was recorded in three sessions resulting in a total of 75 dialogues and about 14 hours of audio and video data. The corpus is available at http://www.uni-ulm.de/in/pit.
A stochastic parsing component has been applied on a French spoken language dialogue corpus, recorded in the framework of the MEDIA evaluation campaign. Realized as an ergodic HMM using Viterbide coding, the parser outputs the most likely semantic representation given a transcribed utterance as input. The semantic sequences used for training and testing have been derived from the semantic representations of the MEDIA corpus. The HMM parameters have been estimated given the word sequences along with their semantic representation. The performance score of the stochastic parser has been automatically determined using the mediaval tool applied to a held out reference corpus. Evaluation results will be presented in the paper.
In this paper we present the setup of an extensive Wizard-of-Oz environment used for the data collection and the development of a dialogue system. The envisioned Perception and Interaction Assistant will act as an independent dialogue partner. Passively observing the dialogue between the two human users with respect to a limited domain, the system should take the initiative and get meaningfully involved in the communication process when required by the conversational situation. The data collection described here involves audio and video data. We aim at building a rich multi-media data corpus to be used as a basis for our research which includes, inter alia, speech and gaze direction recognition, dialogue modelling and proactivity of the system. We further aspire to obtain data with emotional content to perfom research on emotion recognition, psychopysiological and usability analysis.