Abstract
In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for concatenation improving BLEU by about +1 across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.- Anthology ID:
- 2021.iwslt-1.33
- Volume:
- Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
- Month:
- August
- Year:
- 2021
- Address:
- Bangkok, Thailand (online)
- Editors:
- Marcello Federico, Alex Waibel, Marta R. Costa-jussà, Jan Niehues, Sebastian Stuker, Elizabeth Salesky
- Venue:
- IWSLT
- SIG:
- SIGSLT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 287–293
- Language:
- URL:
- https://aclanthology.org/2021.iwslt-1.33
- DOI:
- 10.18653/v1/2021.iwslt-1.33
- Cite (ACL):
- Toan Q. Nguyen, Kenton Murray, and David Chiang. 2021. Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 287–293, Bangkok, Thailand (online). Association for Computational Linguistics.
- Cite (Informal):
- Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution (Nguyen et al., IWSLT 2021)
- PDF:
- https://preview.aclanthology.org/jeptaln-2024-ingestion/2021.iwslt-1.33.pdf