wav2vec-S: Adapting Pre-trained Speech Models for Streaming

Biao Fu; Kai Fan; Minpeng Liao; Yidong Chen; Xiaodong Shi; Zhongqiang Huang

doi:10.18653/v1/2024.findings-acl.681

wav2vec-S: Adapting Pre-trained Speech Models for Streaming

Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, Zhongqiang Huang

Abstract

Pre-trained speech models, such as wav2vec 2.0, have significantly advanced speech-related tasks, including speech recognition and translation. However, their applicability in streaming scenarios is limited because these models are trained on complete utterances, leading to a mismatch with incremental streaming inputs. This paper identifies three critical design aspects within the architecture of wav2vec 2.0 and proposes a novel model, wav2vec-S, which incorporates simple modifications to ensure consistent speech representations during both training and inference phases for streaming speech inputs. Furthermore, we demonstrate that wav2vec-S models can be efficiently adapted from pre-trained wav2vec 2.0 models through continued pre-training and effectively finetuned to meet various latency requirements in downstream applications. Experiments on speech recognition and translation tasks show that wav2vec-S outperforms strong baseline models and achieves a superior balance between quality and latency.

Anthology ID:: 2024.findings-acl.681
Volume:: Findings of the Association for Computational Linguistics ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand and virtual meeting
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11465–11480
Language:
URL:: https://aclanthology.org/2024.findings-acl.681
DOI:: 10.18653/v1/2024.findings-acl.681
Bibkey:
Cite (ACL):: Biao Fu, Kai Fan, Minpeng Liao, Yidong Chen, Xiaodong Shi, and Zhongqiang Huang. 2024. wav2vec-S: Adapting Pre-trained Speech Models for Streaming. In Findings of the Association for Computational Linguistics ACL 2024, pages 11465–11480, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: wav2vec-S: Adapting Pre-trained Speech Models for Streaming (Fu et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-2024-clasp/2024.findings-acl.681.pdf

PDF Search