Language Tokens: Simply Improving Zero-Shot Multi-Aligned Translation in Encoder-Decoder Models
Muhammad N ElNokrashy, Amr Hendy, Mohamed Maher, Mohamed Afify, Hany Hassan
Abstract
This paper proposes a simple and effective method to improve direct translation for the zero-shot case and when direct data is available. We modify the input tokens at both the encoder and decoder to include signals for the source and target languages. We show a performance gain when training from scratch, or finetuning a pretrained model with the proposed setup. In in-house experiments, our method shows nearly a 10.0 BLEU points difference depending on the stoppage criteria. In a WMT-based setting, we see 1.3 and 0.4 BLEU points improvement for the zero-shot setting, and when using direct data for training, respectively, while from-English performance improves by 4.17 and 0.85 BLEU points. In the low-resource setting, we see a 1.5 ∼ 1.7 point improvement when finetuning on directly translated domain data.- Anthology ID:
- 2022.amta-research.6
- Volume:
- Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
- Month:
- September
- Year:
- 2022
- Address:
- Orlando, USA
- Editors:
- Kevin Duh, Francisco Guzmán
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 70–82
- Language:
- URL:
- https://aclanthology.org/2022.amta-research.6
- DOI:
- Cite (ACL):
- Muhammad N ElNokrashy, Amr Hendy, Mohamed Maher, Mohamed Afify, and Hany Hassan. 2022. Language Tokens: Simply Improving Zero-Shot Multi-Aligned Translation in Encoder-Decoder Models. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 70–82, Orlando, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- Language Tokens: Simply Improving Zero-Shot Multi-Aligned Translation in Encoder-Decoder Models (N ElNokrashy et al., AMTA 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.amta-research.6.pdf
- Data
- CCAligned, CCMatrix