2025
pdf
bib
abs
ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT
Mikołaj Pokrywka
|
Wojciech Kusa
|
Mieszko Rutkowski
|
Mikołaj Koszowski
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT– a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product’s category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.
2022
pdf
bib
abs
Evaluation of Transfer Learning for Polish with a Text-to-Text Model
Aleksandra Chrabrowa
|
Łukasz Dragan
|
Karol Grzegorczyk
|
Dariusz Kajtoch
|
Mikołaj Koszowski
|
Robert Mroczkowski
|
Piotr Rybak
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. In particular, since summarization and question answering lack benchmark datasets for the Polish language, we describe in detail their construction and make them publicly available. Additionally, we present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective. Unsupervised denoising pre-training is performed efficiently by initializing the model weights with a multi-lingual T5 (mT5) counterpart. We evaluate the performance of plT5, mT5, Polish BART (plBART), and Polish GPT-2 (papuGaPT2). The plT5 scores top on all of these tasks except summarization, where plBART is best. In general (except summarization), the larger the model, the better the results. The encoder-decoder architectures prove to be better than the decoder-only equivalent.
2021
pdf
bib
abs
Allegro.eu Submission to WMT21 News Translation Task
Mikołaj Koszowski
|
Karol Grzegorczyk
|
Tsimur Hadeliya
Proceedings of the Sixth Conference on Machine Translation
We submitted two uni-directional models, one for English→Icelandic direction and other for Icelandic→English direction. Our news translation system is based on the transformer-big architecture, it makes use of corpora filtering, back-translation and forward translation applied to parallel and monolingual data alike
2020
pdf
bib
abs
Samsung R&D Institute Poland submission to WMT20 News Translation Task
Mateusz Krubiński
|
Marcin Chochowski
|
Bartłomiej Boczek
|
Mikołaj Koszowski
|
Adam Dobrowolski
|
Marcin Szymański
|
Paweł Przybysz
Proceedings of the Fifth Conference on Machine Translation
This paper describes the submission to the WMT20 shared news translation task by Samsung R&D Institute Poland. We submitted systems for six language directions: English to Czech, Czech to English, English to Polish, Polish to English, English to Inuktitut and Inuktitut to English. For each, we trained a single-direction model. However, directions including English, Polish and Czech were derived from a common multilingual base, which was later fine-tuned on each particular direction. For all the translation directions, we used a similar training regime, with iterative training corpora improvement through back-translation and model ensembling. For the En → Cs direction, we additionally leveraged document-level information by re-ranking the beam output with a separate model.