Neural Machine Translation between similar South-Slavic languages

Maja Popović, Alberto Poncelas


Abstract
This paper describes the ADAPT-DCU machine translation systems built for the WMT 2020 shared task on Similar Language Translation. We explored several set-ups for NMT for Croatian–Slovenian and Serbian–Slovenian language pairs in both translation directions. Our experiments focus on different amounts and types of training data: we first apply basic filtering on the OpenSubtitles training corpora, then we perform additional cleaning of remaining misaligned segments based on character n-gram matching. Finally, we make use of additional monolingual data by creating synthetic parallel data through back-translation. Automatic evaluation shows that multilingual systems with joint Serbian and Croatian data are better than bilingual, as well as that character-based cleaning leads to improved scores while using less data. The results also confirm once more that adding back-translated data further improves the performance, especially when the synthetic data is similar to the desired domain of the development and test set. This, however, might come at a price of prolonged training time, especially for multitarget systems.
Anthology ID:
2020.wmt-1.51
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
430–436
Language:
URL:
https://aclanthology.org/2020.wmt-1.51
DOI:
Bibkey:
Cite (ACL):
Maja Popović and Alberto Poncelas. 2020. Neural Machine Translation between similar South-Slavic languages. In Proceedings of the Fifth Conference on Machine Translation, pages 430–436, Online. Association for Computational Linguistics.
Cite (Informal):
Neural Machine Translation between similar South-Slavic languages (Popović & Poncelas, WMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.wmt-1.51.pdf