Abstract
This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain sub-words in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method.- Anthology ID:
- 2021.wmt-1.87
- Volume:
- Proceedings of the Sixth Conference on Machine Translation
- Month:
- November
- Year:
- 2021
- Address:
- Online
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 863–867
- Language:
- URL:
- https://aclanthology.org/2021.wmt-1.87
- DOI:
- Cite (ACL):
- Bardia Rafieian and Marta R. Costa-jussa. 2021. High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task. In Proceedings of the Sixth Conference on Machine Translation, pages 863–867, Online. Association for Computational Linguistics.
- Cite (Informal):
- High Frequent In-domain Words Segmentation and Forward Translation for the WMT21 Biomedical Task (Rafieian & Costa-jussa, WMT 2021)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2021.wmt-1.87.pdf