@inproceedings{ash-etal-2018-speechmatics,
    title = "The Speechmatics Parallel Corpus Filtering System for {WMT}18",
    author = "Ash, Tom  and
      Francis, Remi  and
      Williams, Will",
    editor = "Bojar, Ond{\v{r}}ej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Fishel, Mark  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huck, Matthias  and
      Yepes, Antonio Jimeno  and
      Koehn, Philipp  and
      Monz, Christof  and
      Negri, Matteo  and
      N{\'e}v{\'e}ol, Aur{\'e}lie  and
      Neves, Mariana  and
      Post, Matt  and
      Specia, Lucia  and
      Turchi, Marco  and
      Verspoor, Karin",
    booktitle = "Proceedings of the Third Conference on Machine Translation: Shared Task Papers",
    month = oct,
    year = "2018",
    address = "Belgium, Brussels",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/iwcs-25-ingestion/W18-6472/",
    doi = "10.18653/v1/W18-6472",
    pages = "853--859",
    abstract = "Our entry to the parallel corpus filtering task uses a two-step strategy. The first step uses a series of pragmatic hard `rules' to remove the worst example sentences. This first step reduces the effective corpus size down from the initial 1 billion to 160 million tokens. The second step uses four different heuristics weighted to produce a score that is then used for further filtering down to 100 or 10 million tokens. Our final system produces competitive results without requiring excessive fine tuning to the exact task or language pair. The first step in isolation provides a very fast filter that gives most of the gains of the final system."
}Markdown (Informal)
[The Speechmatics Parallel Corpus Filtering System for WMT18](https://preview.aclanthology.org/iwcs-25-ingestion/W18-6472/) (Ash et al., WMT 2018)
ACL