Abstract
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.- Anthology ID:
- D18-1296
- Volume:
- Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
- Month:
- October-November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
- Venue:
- EMNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2764–2768
- Language:
- URL:
- https://aclanthology.org/D18-1296
- DOI:
- 10.18653/v1/D18-1296
- Cite (ACL):
- Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level Language Modeling for Conversational Speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2764–2768, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Session-level Language Modeling for Conversational Speech (Xiong et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/D18-1296.pdf