Compact Speech Translation Models via Discrete Speech Units Pretraining

Tsz Kin Lam; Alexandra Birch; Barry Haddow

doi:10.18653/v1/2024.iwslt-1.16

Compact Speech Translation Models via Discrete Speech Units Pretraining

Tsz Kin Lam, Alexandra Birch, Barry Haddow

Abstract

We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

Anthology ID:: 2024.iwslt-1.16
Volume:: Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand (in-person and online)
Editors:: Elizabeth Salesky, Marcello Federico, Marine Carpuat
Venue:: IWSLT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 114–124
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.iwslt-1.16/
DOI:: 10.18653/v1/2024.iwslt-1.16
Bibkey:
Cite (ACL):: Tsz Kin Lam, Alexandra Birch, and Barry Haddow. 2024. Compact Speech Translation Models via Discrete Speech Units Pretraining. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 114–124, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.
Cite (Informal):: Compact Speech Translation Models via Discrete Speech Units Pretraining (Lam et al., IWSLT 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.iwslt-1.16.pdf

PDF Cite Search Fix data