2022
pdf
abs
Samsung R&D Institute Poland Participation in WMT 2022
Adam Dobrowolski
|
Mateusz Klimaszewski
|
Adam Myśliwy
|
Marcin Szymański
|
Jakub Kowalski
|
Kornelia Szypuła
|
Paweł Przewłocki
|
Paweł Przybysz
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents the system description of Samsung R&D Institute Poland participation in WMT 2022 for General MT solution for medium and low resource languages: Russian and Croatian. Our approach combines iterative noised/tagged back-translation and iterative distillation. We investigated different monolingual resources and compared their influence on final translations. We used available BERT-likemodels for text classification and for extracting domains of texts. Then we prepared an ensemble of NMT models adapted to multiple domains. Finally we attempted to predict ensemble weight vectors from the BERT-based domain classifications for individual sentences. Our final trained models reached quality comparable to best online translators using only limited constrained resources during training.
2021
pdf
abs
Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task
Adam Dobrowolski
|
Marcin Szymański
|
Marcin Chochowski
|
Paweł Przybysz
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.
2020
pdf
abs
Samsung R&D Institute Poland submission to WMT20 News Translation Task
Mateusz Krubiński
|
Marcin Chochowski
|
Bartłomiej Boczek
|
Mikołaj Koszowski
|
Adam Dobrowolski
|
Marcin Szymański
|
Paweł Przybysz
Proceedings of the Fifth Conference on Machine Translation
This paper describes the submission to the WMT20 shared news translation task by Samsung R&D Institute Poland. We submitted systems for six language directions: English to Czech, Czech to English, English to Polish, Polish to English, English to Inuktitut and Inuktitut to English. For each, we trained a single-direction model. However, directions including English, Polish and Czech were derived from a common multilingual base, which was later fine-tuned on each particular direction. For all the translation directions, we used a similar training regime, with iterative training corpora improvement through back-translation and model ensembling. For the En → Cs direction, we additionally leveraged document-level information by re-ranking the beam output with a separate model.
pdf
abs
SRPOL’s System for the IWSLT 2020 End-to-End Speech Translation Task
Tomasz Potapczyk
|
Pawel Przybysz
Proceedings of the 17th International Conference on Spoken Language Translation
We took part in the offline End-to-End English to German TED lectures translation task. We based our solution on our last year’s submission. We used a slightly altered Transformer architecture with ResNet-like convolutional layer preparing the audio input to Transformer encoder. To improve the model’s quality of translation we introduced two regularization techniques and trained on machine translated Librispeech corpus in addition to iwslt-corpus, TEDLIUM2 andMust_C corpora. Our best model scored almost 3 BLEU higher than last year’s model. To segment 2020 test set we used exactly the same procedure as last year.
2019
pdf
abs
Samsung and University of Edinburgh’s System for the IWSLT 2019
Joanna Wetesko
|
Marcin Chochowski
|
Pawel Przybysz
|
Philip Williams
|
Roman Grundkiewicz
|
Rico Sennrich
|
Barry Haddow
|
Barone
|
Valerio Miceli
|
Alexandra Birch
Proceedings of the 16th International Conference on Spoken Language Translation
This paper describes the joint submission to the IWSLT 2019 English to Czech task by Samsung RD Institute, Poland, and the University of Edinburgh. Our submission was ultimately produced by combining four Transformer systems through a mixture of ensembling and reranking.
pdf
abs
Samsung’s System for the IWSLT 2019 End-to-End Speech Translation Task
Tomasz Potapczyk
|
Pawel Przybysz
|
Marcin Chochowski
|
Artur Szumaczuk
Proceedings of the 16th International Conference on Spoken Language Translation
This paper describes the submission to IWSLT 2019 End- to-End speech translation task by Samsung R&D Institute, Poland. We decided to focus on end-to-end English to German TED lectures translation and did not provide any submission for other speech tasks. We used a slightly altered Transformer architecture with standard convolutional layer preparing the audio input to Transformer en- coder. Additionally, we propose an audio segmentation al- gorithm maximizing BLEU score on tst2015 test set.
2018
pdf
abs
Samsung and University of Edinburgh’s System for the IWSLT 2018 Low Resource MT Task
Philip Williams
|
Marcin Chochowski
|
Pawel Przybysz
|
Rico Sennrich
|
Barry Haddow
|
Alexandra Birch
Proceedings of the 15th International Conference on Spoken Language Translation
This paper describes the joint submission to the IWSLT 2018 Low Resource MT task by Samsung R&D Institute, Poland, and the University of Edinburgh. We focused on supplementing the very limited in-domain Basque-English training data with out-of-domain data, with synthetic data, and with data for other language pairs. We also experimented with a variety of model architectures and features, which included the development of extensions to the Nematus toolkit. Our submission was ultimately produced by a system combination in which we reranked translations from our strongest individual system using multiple weaker systems.
2017
pdf
abs
The Samsung and University of Edinburgh’s submission to IWSLT17
Pawel Przybysz
|
Marcin Chochowski
|
Rico Sennrich
|
Barry Haddow
|
Alexandra Birch
Proceedings of the 14th International Conference on Spoken Language Translation
This paper describes the joint submission of Samsung Research and Development, Warsaw, Poland and the University of Edinburgh team to the IWSLT MT task for TED talks. We took part in two translation directions, en-de and de-en. We also participated in the en-de and de-en lectures SLT task. The models have been trained with an attentional encoder-decoder model using the BiDeep model in Nematus. We filtered the training data to reduce the problem of noisy data, and we use back-translated monolingual data for domain-adaptation. We demonstrate the effectiveness of the different techniques that we applied via ablation studies. Our submission system outperforms our baseline, and last year’s University of Edinburgh submission to IWSLT, by more than 5 BLEU.