Marcin Szymański
2022
Samsung R&D Institute Poland Participation in WMT 2022
Adam Dobrowolski
|
Mateusz Klimaszewski
|
Adam Myśliwy
|
Marcin Szymański
|
Jakub Kowalski
|
Kornelia Szypuła
|
Paweł Przewłocki
|
Paweł Przybysz
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents the system description of Samsung R&D Institute Poland participation in WMT 2022 for General MT solution for medium and low resource languages: Russian and Croatian. Our approach combines iterative noised/tagged back-translation and iterative distillation. We investigated different monolingual resources and compared their influence on final translations. We used available BERT-likemodels for text classification and for extracting domains of texts. Then we prepared an ensemble of NMT models adapted to multiple domains. Finally we attempted to predict ensemble weight vectors from the BERT-based domain classifications for individual sentences. Our final trained models reached quality comparable to best online translators using only limited constrained resources during training.
2021
Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task
Adam Dobrowolski
|
Marcin Szymański
|
Marcin Chochowski
|
Paweł Przybysz
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.
2020
Samsung R&D Institute Poland submission to WMT20 News Translation Task
Mateusz Krubiński
|
Marcin Chochowski
|
Bartłomiej Boczek
|
Mikołaj Koszowski
|
Adam Dobrowolski
|
Marcin Szymański
|
Paweł Przybysz
Proceedings of the Fifth Conference on Machine Translation
This paper describes the submission to the WMT20 shared news translation task by Samsung R&D Institute Poland. We submitted systems for six language directions: English to Czech, Czech to English, English to Polish, Polish to English, English to Inuktitut and Inuktitut to English. For each, we trained a single-direction model. However, directions including English, Polish and Czech were derived from a common multilingual base, which was later fine-tuned on each particular direction. For all the translation directions, we used a similar training regime, with iterative training corpora improvement through back-translation and model ensembling. For the En → Cs direction, we additionally leveraged document-level information by re-ranking the beam output with a separate model.