High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task

Bardia Rafieian; Marta R. Costa-jussà

High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task

Abstract

This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain sub-words in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method.

Anthology ID:: 2021.wmt-1.87
Volume:: Proceedings of the Sixth Conference on Machine Translation
Month:: November
Year:: 2021
Address:: Online
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 863–867
Language:
URL:: https://aclanthology.org/2021.wmt-1.87
DOI:
Bibkey:
Cite (ACL):: Bardia Rafieian and Marta R. Costa-jussa. 2021. High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task. In Proceedings of the Sixth Conference on Machine Translation, pages 863–867, Online. Association for Computational Linguistics.
Cite (Informal):: High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task (Rafieian & Costa-jussa, WMT 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/nodalida-main-page/2021.wmt-1.87.pdf

PDF Search