Abstract
A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.- Anthology ID:
- W18-6474
- Volume:
- Proceedings of the Third Conference on Machine Translation: Shared Task Papers
- Month:
- October
- Year:
- 2018
- Address:
- Belgium, Brussels
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 867–871
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/W18-6474/
- DOI:
- 10.18653/v1/W18-6474
- Cite (ACL):
- Eduard Barbu and Verginica Barbu Mititelu. 2018. A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 867–871, Belgium, Brussels. Association for Computational Linguistics.
- Cite (Informal):
- A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora (Barbu & Barbu Mititelu, WMT 2018)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/W18-6474.pdf