Abstract
We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.- Anthology ID:
- W19-5433
- Volume:
- Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 245–251
- Language:
- URL:
- https://aclanthology.org/W19-5433
- DOI:
- 10.18653/v1/W19-5433
- Cite (ACL):
- Amittai Axelrod, Anish Kumar, and Steve Sloto. 2019. Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 245–251, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data (Axelrod et al., WMT 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W19-5433.pdf