Jichen Yang


2023

One of the main problems in speech translation is the mismatches between different modalities. The second problem, scarcity of parallel data covering multiple modalities, means that the end-to-end multi-modal models tend to perform worse than cascade models, although there are exceptions under favorable conditions. To address these problems, we propose an end-to-end zero-shot speech translation model, connecting two pre-trained uni-modality modules via word rotator’s distance. The model retains the ability of zero-shot, which is like cascade models, and also can be trained in an end-to-end style to avoid error propagation. Our comprehensive experiments on the MuST-C benchmarks show that our end-to-end zero-shot approach performs better than or as well as those of the CTC-based cascade models and that our end-to-end model with supervised training also matches the latest baselines.