Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data

Amittai Axelrod, Anish Kumar, Steve Sloto


Abstract
We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.
Anthology ID:
W19-5433
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
245–251
Language:
URL:
https://aclanthology.org/W19-5433
DOI:
10.18653/v1/W19-5433
Bibkey:
Cite (ACL):
Amittai Axelrod, Anish Kumar, and Steve Sloto. 2019. Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 245–251, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data (Axelrod et al., WMT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/W19-5433.pdf
Poster:
 W19-5433.Poster.pdf