wav2vec-S: Adapting Pre-trained Speech Models for Streaming

Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, Zhongqiang Huang


Abstract
Pre-trained speech models, such as wav2vec 2.0, have significantly advanced speech-related tasks, including speech recognition and translation. However, their applicability in streaming scenarios is limited because these models are trained on complete utterances, leading to a mismatch with incremental streaming inputs. This paper identifies three critical design aspects within the architecture of wav2vec 2.0 and proposes a novel model, wav2vec-S, which incorporates simple modifications to ensure consistent speech representations during both training and inference phases for streaming speech inputs. Furthermore, we demonstrate that wav2vec-S models can be efficiently adapted from pre-trained wav2vec 2.0 models through continued pre-training and effectively finetuned to meet various latency requirements in downstream applications. Experiments on speech recognition and translation tasks show that wav2vec-S outperforms strong baseline models and achieves a superior balance between quality and latency.
Anthology ID:
2024.findings-acl.681
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11465–11480
Language:
URL:
https://aclanthology.org/2024.findings-acl.681
DOI:
10.18653/v1/2024.findings-acl.681
Bibkey:
Cite (ACL):
Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, and Zhongqiang Huang. 2024. wav2vec-S: Adapting Pre-trained Speech Models for Streaming. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11465–11480, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
wav2vec-S: Adapting Pre-trained Speech Models for Streaming (Fu et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/autopr/2024.findings-acl.681.pdf