Parallel Corpus Filtering Based on Fuzzy String Matching

Sukanta Sen, Asif Ekbal, Pushpak Bhattacharyya


Abstract
In this paper, we describe the IIT Patna’s submission to WMT 2019 shared task on parallel corpus filtering. This shared task asks the participants to develop methods for scoring each parallel sentence from a given noisy parallel corpus. Quality of the scoring method is judged based on the quality of SMT and NMT systems trained on smaller set of high-quality parallel sentences sub-sampled from the original noisy corpus. This task has two language pairs. We submit for both the Nepali-English and Sinhala-English language pairs. We define fuzzy string matching score between English and the translated (into English) source based on Levenshtein distance. Based on the scores, we sub-sample two sets (having 1 million and 5 millions English tokens) of parallel sentences from each parallel corpus, and train SMT systems for development purpose only. The organizers publish the official evaluation using both SMT and NMT on the final official test set. Total 10 teams participated in the shared task and according the official evaluation, our scoring method obtains 2nd position in the team ranking for 1-million NepaliEnglish NMT and 5-million Sinhala-English NMT categories.
Anthology ID:
W19-5440
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
289–293
Language:
URL:
https://aclanthology.org/W19-5440
DOI:
10.18653/v1/W19-5440
Bibkey:
Cite (ACL):
Sukanta Sen, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Parallel Corpus Filtering Based on Fuzzy String Matching. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 289–293, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Parallel Corpus Filtering Based on Fuzzy String Matching (Sen et al., WMT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/W19-5440.pdf