Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity
Alberto Poncelas, Johanes Effendi, Ohnmar Htun, Sunil Yadav, Dongzhe Wang, Saurabh Jain
Abstract
This paper introduces our neural machine translation system’s participation in the WAT 2022 shared translation task (team ID: sakura). We participated in the Parallel Data Filtering Task. Our approach based on Feature Decay Algorithms achieved +1.4 and +2.4 BLEU points for English to Japanese and Japanese to English respectively compared to the model trained on the full dataset, showing the effectiveness of FDA on in-domain data selection.- Anthology ID:
- 2022.wat-1.7
- Volume:
- Proceedings of the 9th Workshop on Asian Translation
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- WAT
- SIG:
- Publisher:
- International Conference on Computational Linguistics
- Note:
- Pages:
- 68–72
- Language:
- URL:
- https://aclanthology.org/2022.wat-1.7
- DOI:
- Cite (ACL):
- Alberto Poncelas, Johanes Effendi, Ohnmar Htun, Sunil Yadav, Dongzhe Wang, and Saurabh Jain. 2022. Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity. In Proceedings of the 9th Workshop on Asian Translation, pages 68–72, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- Cite (Informal):
- Rakuten’s Participation in WAT 2022: Parallel Dataset Filtering by Leveraging Vocabulary Heterogeneity (Poncelas et al., WAT 2022)
- PDF:
- https://preview.aclanthology.org/naacl-24-ws-corrections/2022.wat-1.7.pdf