Abstract
Deep neural networks have demonstrated their capacity in extracting features from speech inputs. However, these features may include non-linguistic speech factors such as timbre and speaker identity, which are not directly related to translation. In this paper, we propose a content-centric speech representation disentanglement learning framework for speech translation, CCSRD, which decomposes speech representations into content representations and non-linguistic representations via representation disentanglement learning. CCSRD consists of a content encoder that encodes linguistic content information from the speech input, a non-content encoder that models non-linguistic speech features, and a disentanglement module that learns disentangled representations with a cyclic reconstructor, feature reconstructor and speaker classifier trained in a multi-task learning way. Experiments on the MuST-C benchmark dataset demonstrate that CCSRD achieves an average improvement of +0.9 BLEU in two settings across five translation directions over the baseline, outperforming state-of-the-art end-to-end speech translation models and cascaded models.- Anthology ID:
- 2023.findings-emnlp.394
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5920–5932
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.394
- DOI:
- 10.18653/v1/2023.findings-emnlp.394
- Cite (ACL):
- Xiaohu Zhao, Haoran Sun, Yikun Lei, Shaolin Zhu, and Deyi Xiong. 2023. CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5920–5932, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation (Zhao et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-emnlp.394.pdf