Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue

Kenta Yamamoto, Ryu Takeda, Kazunori Komatani


Abstract
In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.
Anthology ID:
2025.iwsds-1.26
Volume:
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Month:
May
Year:
2025
Address:
Bilbao, Spain
Editors:
Maria Ines Torres, Yuki Matsuda, Zoraida Callejas, Arantza del Pozo, Luis Fernando D'Haro
Venues:
IWSDS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
245–253
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.iwsds-1.26/
DOI:
Bibkey:
Cite (ACL):
Kenta Yamamoto, Ryu Takeda, and Kazunori Komatani. 2025. Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue. In Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 245–253, Bilbao, Spain. Association for Computational Linguistics.
Cite (Informal):
Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue (Yamamoto et al., IWSDS 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.iwsds-1.26.pdf