Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity

Alberto Poncelas; Johanes Effendi; Ohnmar Htun; Sunil Yadav; Dongzhe Wang; Saurabh Jain

Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity

Alberto Poncelas, Johanes Effendi, Ohnmar Htun, Sunil Yadav, Dongzhe Wang, Saurabh Jain

Abstract

This paper introduces our neural machine translation system’s participation in the WAT 2022 shared translation task (team ID: sakura). We participated in the Parallel Data Filtering Task. Our approach based on Feature Decay Algorithms achieved +1.4 and +2.4 BLEU points for English to Japanese and Japanese to English respectively compared to the model trained on the full dataset, showing the effectiveness of FDA on in-domain data selection.

Anthology ID:: 2022.wat-1.7
Volume:: Proceedings of the 9th Workshop on Asian Translation
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Venue:: WAT
SIG:
Publisher:: International Conference on Computational Linguistics
Note:
Pages:: 68–72
Language:
URL:: https://aclanthology.org/2022.wat-1.7
DOI:
Bibkey:
Cite (ACL):: Alberto Poncelas, Johanes Effendi, Ohnmar Htun, Sunil Yadav, Dongzhe Wang, and Saurabh Jain. 2022. Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity. In Proceedings of the 9th Workshop on Asian Translation, pages 68–72, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):: Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity (Poncelas et al., WAT 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl-24-ws-corrections/2022.wat-1.7.pdf

PDF Search