Abstract
We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.- Anthology ID:
- W19-5433
- Volume:
- Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 245–251
- Language:
- URL:
- https://aclanthology.org/W19-5433
- DOI:
- 10.18653/v1/W19-5433
- Cite (ACL):
- Amittai Axelrod, Anish Kumar, and Steve Sloto. 2019. Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 245–251, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data (Axelrod et al., WMT 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/W19-5433.pdf