Audio Segmentation for Robust Real-Time Speech Recognition Based on Neural Networks

Micha Wetzel; Matthias Sperber; Alex Waibel

Audio Segmentation for Robust Real-Time Speech Recognition Based on Neural Networks

Micha Wetzel, Matthias Sperber, Alexander Waibel

Abstract

Speech that contains multimedia content can pose a serious challenge for real-time automatic speech recognition (ASR) for two reasons: (1) The ASR produces meaningless output, hurting the readability of the transcript. (2) The search space of the ASR is blown up when multimedia content is encountered, resulting in large delays that compromise real-time requirements. This paper introduces a segmenter that aims to remove these problems by detecting music and noise segments in real-time and replacing them with silence. We propose a two step approach, consisting of frame classification and smoothing. First, a classifier detects speech and multimedia on the frame level. In the second step the smoothing algorithm considers the temporal context to prevent rapid class fluctuations. We investigate in frame classification and smoothing settings to obtain an appealing accuracy-latency-tradeoff. The proposed segmenter yields increases the transcript quality of an ASR system by removing on average 39 % of the errors caused by non-speech in the audio stream, while maintaining a real-time applicable delay of 270 milliseconds.

Anthology ID:: 2016.iwslt-1.4
Volume:: Proceedings of the 13th International Conference on Spoken Language Translation
Month:: December 8-9
Year:: 2016
Address:: Seattle, Washington D.C
Editors:: Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Rolando Cattoni, Marcello Federico
Venue:: IWSLT
SIG:: SIGSLT
Publisher:: International Workshop on Spoken Language Translation
Note:
Pages:
Language:
URL:: https://aclanthology.org/2016.iwslt-1.4
DOI:
Bibkey:
Cite (ACL):: Micha Wetzel, Matthias Sperber, and Alexander Waibel. 2016. Audio Segmentation for Robust Real-Time Speech Recognition Based on Neural Networks. In Proceedings of the 13th International Conference on Spoken Language Translation, Seattle, Washington D.C. International Workshop on Spoken Language Translation.
Cite (Informal):: Audio Segmentation for Robust Real-Time Speech Recognition Based on Neural Networks (Wetzel et al., IWSLT 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/2016.iwslt-1.4.pdf
Data: MUSAN

PDF Search