Abstract
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.- Anthology ID:
 - D18-1296
 - Volume:
 - Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
 - Month:
 - October-November
 - Year:
 - 2018
 - Address:
 - Brussels, Belgium
 - Editors:
 - Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
 - Venue:
 - EMNLP
 - SIG:
 - SIGDAT
 - Publisher:
 - Association for Computational Linguistics
 - Note:
 - Pages:
 - 2764–2768
 - Language:
 - URL:
 - https://aclanthology.org/D18-1296
 - DOI:
 - 10.18653/v1/D18-1296
 - Cite (ACL):
 - Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level Language Modeling for Conversational Speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2764–2768, Brussels, Belgium. Association for Computational Linguistics.
 - Cite (Informal):
 - Session-level Language Modeling for Conversational Speech (Xiong et al., EMNLP 2018)
 - PDF:
 - https://preview.aclanthology.org/ingest-acl-2023-videos/D18-1296.pdf