Elisabeth Andre

Also published as: Elisabeth André


2026

We introduce MUDiC, a novel dataset on task-based multi-user interactions in chatbots. Unlike most traditional dialogue corpora that focus on one-to-one human–chatbot exchanges, this dataset captures conversations involving two human participants engaging with a single system. The data include diverse conversational contexts such as shared group task, user intents, and mechanisms to deal with off-topic talk. MUDiC consists of 1,689 dialogue exchanges between 20 groups and the chatbot. Each session is annotated with user id, interaction turns, and intents and dialogue acts, enabling an analysis of group conversational dynamics. Consequently, the dataset aims to support tasks such as multi-user dialogue modelling, intent disambiguation, and moderation behaviour, which are relevant factors for the design of socially aware chatbots.
Teacher-parent conversations are critical for student success, yet teachers often lack structured training in counseling communication skills. We present the first annotated corpus of teacher-parent counseling conversations consisting of 59 German dialogues (approximately 6k sentences, 21k annotations) simulated by prospective elementary school teachers, peers, and professional actors. The corpus features theory-grounded annotations for conversational phases (Beginning, Informational, Argumentative, Decision-Making, Concluding) and communication techniques (Paraphrasing, Verbalizing, Structuring). We provide detailed annotation guidelines operationalizing established counseling pedagogy frameworks for computational analysis. Inter-annotator agreement analysis reveals substantial agreement (Fleiss’ k = 0.669 to 0.724, Krippendorff’s a = 0.666 to 0.735). Our analysis reveals confusion patterns, providing insights into counseling discourse structure. Baseline experiments with BERT-based models and open-source LLMs achieve F1 scores of up to 71% depending on task and model. The corpus, guidelines, and baseline code are publicly available under CC BY-NC-SA 4.0 license, enabling research on automated dialogue analysis and AI-based training tools for teacher education.
The increasing application of Large Language Models (LLMs) in everyday tasks and at work highlights the crucial importance of trust in human-AI collaboration, particularly when an AI system fails. This paper investigates the effectiveness of failure communication strategies for trust repair in collaborative physical tasks involving a a chat-based AI assistant. A controlled experiment in which participants built LEGO cars guided by an LLM-based AI Assistant was used to evaluate whether findings from trust repair in a virtual environment, such as chatbots, translate to an environment comprising tangible tasks, and whether the timing of trust repair influences the outcome. Results indicate that actively communicating mistakes significantly improves trust compared to a no repair strategy, and that early repair tends to be more effective, indicating that failure communication, independent of the timing, is important for an appropriate calibration of trust.

2025

In today’s assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system’s suitability for industrial applications.
Dialogue agents become more engaging through recipient design, which needs user-specific information. However, a user’s identification with marginalized communities, such as migration or disability background, can elicit biased language. This study compares LLM responses to neurodivergent user personas with disclosed vs. masked neurodivergent identities. A dataset built from public Instagram comments was used to evaluate four open-source models on story generation, dialogue generation, and retrieval-augmented question answering. Our analyses show biases in user’s identity construction across all models and tasks. Binary classifiers trained on each model can distinguish between language generated for prompts with or without self-disclosures, with stronger biases linked to more explicit disclosures. Some models’ safety mechanisms result in denial of service behaviors. LLM’s recipient design to neurodivergent identities relies on stereotypes tied to neurodivergence.

2021

This paper presents an overview of AVASAG; an ongoing applied-research project developing a text-to-sign-language translation system for public services. We describe the scientific innovation points (geometry-based SL-description, 3D animation and video corpus, simplified annotation scheme, motion capture strategy) and the overall translation pipeline.
Human-AI collaboration, a long standing goal in AI, refers to a partnership where a human and artificial intelligence work together towards a shared goal. Collaborative dialog allows human-AI teams to communicate and leverage strengths from both partners. To design collaborative dialog systems, it is important to understand what mental models users form about their AI-dialog partners, however, how users perceive these systems is not fully understood. In this study, we designed a novel, collaborative, communication-based puzzle game and explanatory dialog system. We created a public corpus from 117 conversations and post-surveys and used this to analyze what mental models users formed. Key takeaways include: Even when users were not engaged in the game, they perceived the AI-dialog partner as intelligent and likeable, implying they saw it as a partner separate from the game. This was further supported by users often overestimating the system’s abilities and projecting human-like attributes which led to miscommunications. We conclude that creating shared mental models between users and AI systems is important to achieving successful dialogs. We propose that our insights on mental models and miscommunication, the game, and our corpus provide useful tools for designing collaborative dialog systems.

2018

Humor is an important aspect in human interaction to regulate conversations, increase interpersonal attraction and trust. For social robots, humor is one aspect to make interactions more natural, enjoyable, and to increase credibility and acceptance. In combination with appropriate non-verbal behavior, natural language generation offers the ability to create content on-the-fly. This work outlines the building-blocks for providing an individual, multimodal interaction experience by shaping the robot’s humor with the help of Natural Language Generation and Reinforcement Learning based on human social signals.

2006

Feature extraction is still a disputed issue for the recognition of emotions from speech. Differences in features for male and female speakers are a well-known problem and it is established that gender-dependent emotion recognizers perform better than gender-independent ones. We propose a way to improve the discriminative quality of gender-dependent features: The emotion recognition system is preceded by an automatic gender detection that decides upon which of two gender-dependent emotion classifiers is used to classify an utterance. This framework was tested on two different databases, one with emotional speech produced by actors and one with spontaneous emotional speech from a Wizard-of-Oz setting. Gender detection achieved an accuracy of about 90 % and the combined gender and emotion recognition system improved the overall recognition rate of a gender-independent emotion recognition system by 2-4 %.

1997

1994

1991