Abstract
Automatic Speech Recognition (ASR) is essential for any voice-based application. The streaming capability of ASR becomes necessary to provide immediate feedback to the user in applications like Voice Search. LSTM/RNN and CTC based ASR systems are very simple to train and deploy for low latency streaming applications but have lower accuracy when compared to the state-of-the-art models. In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. We evaluate how various modifications in vanilla LSTM training improve the system’s accuracy while preserving the streaming capabilities. We also discuss a simple integration of end-of-speech (EOS) detection with CTC models, which helps reduce the overall search latency. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS, with ~1300 ms (~46.64%) reduction in latency.- Anthology ID:
- 2023.acl-industry.26
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 276–283
- Language:
- URL:
- https://aclanthology.org/2023.acl-industry.26
- DOI:
- Cite (ACL):
- Abhinav Goyal and Nikesh Garera. 2023. Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 276–283, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce (Goyal & Garera, ACL 2023)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/2023.acl-industry.26.pdf