Integrating Respiration into Voice Activity Projection for Enhancing Turn-taking Performance

Takao Obi, Kotaro Funakoshi


Abstract
Voice Activity Projection (VAP) models predict upcoming voice activities on a continuous timescale, enabling more nuanced turn-taking behaviors in spoken dialogue systems. Although previous studies have shown robust performance with audio-based VAP, the potential of incorporating additional physiological information, such as respiration, remains relatively unexplored. In this paper, we investigate whether respiratory information can enhance VAP performance in turn-taking. To this end, we collected Japanese dialogue data with synchronized audio and respiratory waveforms, and then we integrated the respiratory information into the VAP model. Our results showed that the VAP model combining audio and respiratory information had better performance than the audio-only model. This finding underscores the potential for improving the turn-taking performance of VAP by incorporating respiration.
Anthology ID:
2025.iwsds-1.28
Volume:
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Month:
May
Year:
2025
Address:
Bilbao, Spain
Editors:
Maria Ines Torres, Yuki Matsuda, Zoraida Callejas, Arantza del Pozo, Luis Fernando D'Haro
Venues:
IWSDS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
272–276
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.iwsds-1.28/
DOI:
Bibkey:
Cite (ACL):
Takao Obi and Kotaro Funakoshi. 2025. Integrating Respiration into Voice Activity Projection for Enhancing Turn-taking Performance. In Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 272–276, Bilbao, Spain. Association for Computational Linguistics.
Cite (Informal):
Integrating Respiration into Voice Activity Projection for Enhancing Turn-taking Performance (Obi & Funakoshi, IWSDS 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.iwsds-1.28.pdf