Marcin Szymański

2025

pdf bib abs
A* Decoding for Machine Translation in LLMs - SRPOL Participation in WMT2025
Adam Dobrowolski | Paweł Przewłocki | Paweł Przybysz | Marcin Szymański | Dawid Siwicki
Proceedings of the Tenth Conference on Machine Translation

SRPOL team submission to WMT2025 introduces innovative approach using A* (A-star) algorithm of decoding in EuroLLM which gives diverse set of translation hypotheses. Subsequent reranking by Comet-QE and NLLB chooses the best of the diversed hypotheses which gives significant improvement of translation quality. The A* algorithm can be applied to decoding in any LLMs or other translation models. The experiment shows that by using free, openly accessible MT models you can achieve translation quality of the best online translators and LLMs using just a PC under your desk.

2022

This paper presents the system description of Samsung R&D Institute Poland participation in WMT 2022 for General MT solution for medium and low resource languages: Russian and Croatian. Our approach combines iterative noised/tagged back-translation and iterative distillation. We investigated different monolingual resources and compared their influence on final translations. We used available BERT-likemodels for text classification and for extracting domains of texts. Then we prepared an ensemble of NMT models adapted to multiple domains. Finally we attempted to predict ensemble weight vectors from the BERT-based domain classifications for individual sentences. Our final trained models reached quality comparable to best online translators using only limited constrained resources during training.

2021

pdf bib abs
Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task
Adam Dobrowolski | Marcin Szymański | Marcin Chochowski | Paweł Przybysz
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper describes the submission to the WAT 2021 Indic Language Multilingual Task by Samsung R&D Institute Poland. The task covered translation between 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We combined a variety of techniques: transliteration, filtering, backtranslation, domain adaptation, knowledge-distillation and finally ensembling of NMT models. We applied an effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora. We experimented with two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora. We researched and applied a novel approach for finding the best hyperparameters for ensembling a number of translation models. All techniques combined gave significant improvement - up to +8 BLEU over baseline results. The quality of the models has been confirmed by the human evaluation where SRPOL models scored best for all 5 manually evaluated languages.

2020

This paper describes the submission to the WMT20 shared news translation task by Samsung R&D Institute Poland. We submitted systems for six language directions: English to Czech, Czech to English, English to Polish, Polish to English, English to Inuktitut and Inuktitut to English. For each, we trained a single-direction model. However, directions including English, Polish and Czech were derived from a common multilingual base, which was later fine-tuned on each particular direction. For all the translation directions, we used a similar training regime, with iterative training corpora improvement through back-translation and model ensembling. For the En → Cs direction, we additionally leveraged document-level information by re-ranking the beam output with a separate model.