CRAF:Cross-Modal Representation Alignment and Fusion for Speech Translation

Zhenbei Guo, Wenzhou Wu, Hua Lai, Yan Xiang, Yuxin Huang, Zhengtao Yu


Abstract
"The end-to-end speech translation task involves directly transforming speech into the text of another language, bypassing the generation of an intermediate transcription. However, existing methods may lose key information during cross-modal length alignment and fail to effectively integrate different representations, resulting in low quality of the fused representation. To address these issues, we propose an efficient method named CRAF for effective cross-modal alignment and fusion for speech translation, which reduces information loss and enhances the integration of cross-modal representations. First, CRAF minimizes information loss by improving the cross-modal length alignment, ensuring the alignment process retains more critical information from the speech modality. Second, CRAF strengthens the integration of cross-modal representations by allowing the model to combine complementary features from diverse modalities, enhancing its capacity to concentrate on the most pertinent and critical information. Finally, we evaluateCRAF by conducting extensive experiments on eight language pairs from the MuST-C dataset.Experiments show that the average BLEU score of CRAF achieves 29.0, outperforming other comparison methods. Our code is available at https://github.com/wu-wen-zhou/first/tree/master."
Anthology ID:
2025.ccl-1.78
Volume:
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Month:
August
Year:
2025
Address:
Jinan, China
Editors:
Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1031–1042
Language:
URL:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.78/
DOI:
Bibkey:
Cite (ACL):
Zhenbei Guo, Wenzhou Wu, Hua Lai, Yan Xiang, Yuxin Huang, and Zhengtao Yu. 2025. CRAF:Cross-Modal Representation Alignment and Fusion for Speech Translation. In Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025), pages 1031–1042, Jinan, China. Chinese Information Processing Society of China.
Cite (Informal):
CRAF:Cross-Modal Representation Alignment and Fusion for Speech Translation (Guo et al., CCL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.78.pdf