Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Sam O’Connor Russell, Naomi Harte


Abstract
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate natural- istic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconfer- encing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio- only turn-taking model across all durations of speaker transitions. We conduct a detailed abla- tion study, which reveals that facial expression features contribute the most to model perfor- mance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of au- tomatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
Anthology ID:
2025.findings-acl.12
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
209–221
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.12/
DOI:
Bibkey:
Cite (ACL):
Sam O’Connor Russell and Naomi Harte. 2025. Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction. In Findings of the Association for Computational Linguistics: ACL 2025, pages 209–221, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction (Russell & Harte, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.12.pdf