Abstract
A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.- Anthology ID:
- W18-6474
- Volume:
- Proceedings of the Third Conference on Machine Translation: Shared Task Papers
- Month:
- October
- Year:
- 2018
- Address:
- Belgium, Brussels
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 867–871
- Language:
- URL:
- https://aclanthology.org/W18-6474
- DOI:
- 10.18653/v1/W18-6474
- Cite (ACL):
- Eduard Barbu and Verginica Barbu Mititelu. 2018. A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 867–871, Belgium, Brussels. Association for Computational Linguistics.
- Cite (Informal):
- A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora (Barbu & Barbu Mititelu, WMT 2018)
- PDF:
- https://preview.aclanthology.org/starsem-semeval-split/W18-6474.pdf