Abstract
In this paper, we describe our submissions for Similar Language Translation Shared Task 2020. We built 12 systems in each direction for Hindi⇐⇒Marathi language pair. This paper outlines initial baseline experiments with various tokenization schemes to train statistical models. Using optimal tokenization scheme among these we created synthetic source side text with back translation. And prune synthetic text with language model scores. This synthetic data was then used along with training data in various settings to build translation models. We also report configuration of the submitted systems and results produced by them.- Anthology ID:
- 2020.wmt-1.55
- Volume:
- Proceedings of the Fifth Conference on Machine Translation
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 451–455
- Language:
- URL:
- https://aclanthology.org/2020.wmt-1.55
- DOI:
- Cite (ACL):
- Saumitra Yadav and Manish Shrivastava. 2020. A3-108 Machine Translation System for Similar Language Translation Shared Task 2020. In Proceedings of the Fifth Conference on Machine Translation, pages 451–455, Online. Association for Computational Linguistics.
- Cite (Informal):
- A3-108 Machine Translation System for Similar Language Translation Shared Task 2020 (Yadav & Shrivastava, WMT 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.wmt-1.55.pdf