Abstract
Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates thatthe performance of our multi-task model is comparable to that of individually trained models while significantly savingcomputational and memory costs (~50% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.- Anthology ID:
- 2024.findings-emnlp.401
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6853–6863
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.401/
- DOI:
- 10.18653/v1/2024.findings-emnlp.401
- Cite (ACL):
- Hawau Olamide Toyin, Hao Li, and Hanan Aldarmaki. 2024. STTATTS: Unified Speech-To-Text And Text-To-Speech Model. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 6853–6863, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- STTATTS: Unified Speech-To-Text And Text-To-Speech Model (Toyin et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-emnlp.401.pdf