Abstract
This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 parallel corpus filtering task. Our scores were produced using a two-step strategy. First, we individually applied a series of filters to remove the ‘bad’ quality sentences. Then, we produced scores for each sentence by weighting these features with a classification model. This methodology allowed us to build a simple and reliable system that is easily adaptable to other language pairs.- Anthology ID:
- W19-5441
- Volume:
- Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 294–300
- Language:
- URL:
- https://aclanthology.org/W19-5441
- DOI:
- 10.18653/v1/W19-5441
- Cite (ACL):
- Raúl Vázquez, Umut Sulubacak, and Jörg Tiedemann. 2019. The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 294–300, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task (Vázquez et al., WMT 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W19-5441.pdf