Abstract
Progress in document-level Machine Translation is hindered by the lack of parallel training data that include context information. In this work, we evaluate the potential of data augmentation techniques to circumvent these limitations, showing that significant gains can be achieved via upsampling, similar context sampling and back-translations, targeted on context-relevant data. We apply these methods on standard document-level datasets in English-German and English-French and demonstrate their relevance to improve the translation of contextual phenomena. In particular, we show that relatively small volumes of targeted data augmentation lead to significant improvements over a strong context-concatenation baseline and standard back-translation of document-level data. We also compare the accuracy of the selected methods depending on data volumes or distance to relevant context information, and explore their use in combination.- Anthology ID:
- 2023.mtsummit-research.25
- Volume:
- Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
- Month:
- September
- Year:
- 2023
- Address:
- Macau SAR, China
- Editors:
- Masao Utiyama, Rui Wang
- Venue:
- MTSummit
- SIG:
- Publisher:
- Asia-Pacific Association for Machine Translation
- Note:
- Pages:
- 298–312
- Language:
- URL:
- https://aclanthology.org/2023.mtsummit-research.25
- DOI:
- Cite (ACL):
- Harritxu Gete, Thierry Etchegoyhen, and Gorka Labaka. 2023. Targeted Data Augmentation Improves Context-aware Neural Machine Translation. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 298–312, Macau SAR, China. Asia-Pacific Association for Machine Translation.
- Cite (Informal):
- Targeted Data Augmentation Improves Context-aware Neural Machine Translation (Gete et al., MTSummit 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2023.mtsummit-research.25.pdf