Naomi Harte

2025

pdf bib abs
Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
Sam O’Connor Russell | Naomi Harte
Findings of the Association for Computational Linguistics: ACL 2025

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate natural- istic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconfer- encing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio- only turn-taking model across all durations of speaker transitions. We conduct a detailed abla- tion study, which reveals that facial expression features contribute the most to model perfor- mance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of au- tomatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

2022

pdf bib abs
RoomReader: A Multimodal Corpus of Online Multiparty Conversational Interactions
Justine Reverdy | Sam O’Connor Russell | Louise Duquenne | Diego Garaialde | Benjamin R. Cowan | Naomi Harte
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present RoomReader, a corpus of multimodal, multiparty conversational interactions in which participants followed a collaborative student-tutor scenario designed to elicit spontaneous speech. The corpus was developed within the wider RoomReader Project to explore multimodal cues of conversational engagement and behavioural aspects of collaborative interaction in online environments. However, the corpus can be used to study a wide range of phenomena in online multimodal interaction. The publicly-shared corpus consists of over 8 hours of video and audio recordings from 118 participants in 30 gender-balanced sessions, in the “in-the-wild” online environment of Zoom. The recordings have been edited, synchronised, and fully transcribed. Student participants have been continuously annotated for engagement with a novel continuous scale. We provide questionnaires measuring engagement and group cohesion collected from the annotators, tutors and participants themselves. We also make a range of accompanying data available such as personality tests and behavioural assessments. The dataset and accompanying psychometrics present a rich resource enabling the exploration of a range of downstream tasks across diverse fields including linguistics and artificial intelligence. This could include the automatic detection of student engagement, analysis of group interaction and collaboration in online conversation, and the analysis of conversational behaviours in an online setting.

2020

pdf bib abs
Neural Generation of Dialogue Response Timings
Matthew Roddy | Naomi Harte
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The timings of spoken response offsets in human dialogue have been shown to vary based on contextual elements of the dialogue. We propose neural models that simulate the distributions of these response offsets, taking into account the response turn as well as the preceding turn. The models are designed to be integrated into the pipeline of an incremental spoken dialogue system (SDS). We evaluate our models using offline experiments as well as human listening tests. We show that human listeners consider certain response timings to be more natural based on the dialogue context. The introduction of these models into SDS pipelines could increase the perceived naturalness of interactions.

Co-authors

Matthew Roddy 1

Venues

Fix author