Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce

Abhinav Goyal, Nikesh Garera


Abstract
Automatic Speech Recognition (ASR) is essential for any voice-based application. The streaming capability of ASR becomes necessary to provide immediate feedback to the user in applications like Voice Search. LSTM/RNN and CTC based ASR systems are very simple to train and deploy for low latency streaming applications but have lower accuracy when compared to the state-of-the-art models. In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. We evaluate how various modifications in vanilla LSTM training improve the system’s accuracy while preserving the streaming capabilities. We also discuss a simple integration of end-of-speech (EOS) detection with CTC models, which helps reduce the overall search latency. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS, with ~1300 ms (~46.64%) reduction in latency.
Anthology ID:
2023.acl-industry.26
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:
July
Year:
2023
Address:
Toronto, Canada
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
276–283
Language:
URL:
https://aclanthology.org/2023.acl-industry.26
DOI:
Bibkey:
Cite (ACL):
Abhinav Goyal and Nikesh Garera. 2023. Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 276–283, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce (Goyal & Garera, ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/starsem-semeval-split/2023.acl-industry.26.pdf