Abstract
Recent dialogue systems typically operate through turn-based spoken interactions between users and agents. These systems heavily depend on accurate Automatic Speech Recognition (ASR), as transcription errors can significantly degrade performance in downstream dialogue tasks. To alleviate this challenge, robust ASR is required, and one effective method is to utilize the dialogue context from user and agent interactions for transcribing the subsequent user utterance. This method incorporates the transcription of the user’s speech and the agent’s response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because the ASR model generates it auto-regressively. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce context noise representation learning to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach involves decoder pre-training with text-based dialogue data and noise representation learning for a context encoder. Evaluated on DSTC11 (MultiWoZ 2.1 audio dialogues), it achieves a 24% relative reduction in Word Error Rate (WER) compared to wav2vec2.0 baselines and a 13% reduction compared to Whisper-large-v2. Notably, in noisy environments where user speech is barely audible, our method proves its effectiveness by utilizing contextual information for accurate transcription. Tested on audio data with strong noise level (Signal Noise Ratio of 0dB), our approach shows up to a 31% relative WER reduction compared to the wav2vec2.0 baseline, providing a reassuring solution for real-world noisy scenarios.