High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task

Bardia Rafieian, Marta R. Costa-jussa


Abstract
This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain sub-words in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method.
Anthology ID:
2021.wmt-1.87
Volume:
Proceedings of the Sixth Conference on Machine Translation
Month:
November
Year:
2021
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
863–867
Language:
URL:
https://aclanthology.org/2021.wmt-1.87
DOI:
Bibkey:
Cite (ACL):
Bardia Rafieian and Marta R. Costa-jussa. 2021. High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task. In Proceedings of the Sixth Conference on Machine Translation, pages 863–867, Online. Association for Computational Linguistics.
Cite (Informal):
High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task (Rafieian & Costa-jussa, WMT 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.wmt-1.87.pdf