Abstract
This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on Similar Language Translation. We explored several set-ups for NMT for Croatian–Slovenian and Serbian–Slovenian language pairs in both translation directions. Our experiments focus on different amounts and types of training data: we first apply basic filtering on the OpenSubtitles training corpora, then we perform additional cleaning of remaining misaligned segments based on character n-gram matching. Finally, we make use of additional monolingual data by creating synthetic parallel data through back-translation. Automatic evaluation shows that multilingual systems with joint Serbian and Croatian data are better than bilingual, as well as that character-based cleaning leads to improved scores while using less data. The results also confirm once more that adding back-translated data further improves the performance, especially when the synthetic data is similar to the desired domain of the development and test set. This, however, might come at a price of prolonged training time, especially for multitarget systems.- Anthology ID:
- 2020.wmt-1.51
- Volume:
- Proceedings of the Fifth Conference on Machine Translation
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 430–436
- Language:
- URL:
- https://aclanthology.org/2020.wmt-1.51
- DOI:
- Cite (ACL):
- Maja Popović and Alberto Poncelas. 2020. Neural Machine Translation between similar South-Slavic languages. In Proceedings of the Fifth Conference on Machine Translation, pages 430–436, Online. Association for Computational Linguistics.
- Cite (Informal):
- Neural Machine Translation between similar South-Slavic languages (Popović & Poncelas, WMT 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.wmt-1.51.pdf