wav2vec-S: Adapting Pre-trained Speech Models for Streaming
Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, Zhongqiang Huang
Abstract
Pre-trained speech models, such as wav2vec 2.0, have significantly advanced speech-related tasks, including speech recognition and translation. However, their applicability in streaming scenarios is limited because these models are trained on complete utterances, leading to a mismatch with incremental streaming inputs. This paper identifies three critical design aspects within the architecture of wav2vec 2.0 and proposes a novel model, wav2vec-S, which incorporates simple modifications to ensure consistent speech representations during both training and inference phases for streaming speech inputs. Furthermore, we demonstrate that wav2vec-S models can be efficiently adapted from pre-trained wav2vec 2.0 models through continued pre-training and effectively finetuned to meet various latency requirements in downstream applications. Experiments on speech recognition and translation tasks show that wav2vec-S outperforms strong baseline models and achieves a superior balance between quality and latency.- Anthology ID:
- 2024.findings-acl.681
- Volume:
- Findings of the Association for Computational Linguistics ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand and virtual meeting
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11465–11480
- Language:
- URL:
- https://aclanthology.org/2024.findings-acl.681
- DOI:
- 10.18653/v1/2024.findings-acl.681
- Cite (ACL):
- Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, and Zhongqiang Huang. 2024. wav2vec-S: Adapting Pre-trained Speech Models for Streaming. In Findings of the Association for Computational Linguistics ACL 2024, pages 11465–11480, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- wav2vec-S: Adapting Pre-trained Speech Models for Streaming (Fu et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2024.findings-acl.681.pdf