@inproceedings{wang-etal-2018-nicts,
    title = "{NICT}{'}s Corpus Filtering Systems for the {WMT}18 Parallel Corpus Filtering Task",
    author = "Wang, Rui  and
      Marie, Benjamin  and
      Utiyama, Masao  and
      Sumita, Eiichiro",
    editor = "Bojar, Ond{\v{r}}ej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Fishel, Mark  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huck, Matthias  and
      Yepes, Antonio Jimeno  and
      Koehn, Philipp  and
      Monz, Christof  and
      Negri, Matteo  and
      N{\'e}v{\'e}ol, Aur{\'e}lie  and
      Neves, Mariana  and
      Post, Matt  and
      Specia, Lucia  and
      Turchi, Marco  and
      Verspoor, Karin",
    booktitle = "Proceedings of the Third Conference on Machine Translation: Shared Task Papers",
    month = oct,
    year = "2018",
    address = "Belgium, Brussels",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/iwcs-25-ingestion/W18-6489/",
    doi = "10.18653/v1/W18-6489",
    pages = "963--967",
    abstract = "This paper presents the NICT{'}s participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance."
}Markdown (Informal)
[NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task](https://preview.aclanthology.org/iwcs-25-ingestion/W18-6489/) (Wang et al., WMT 2018)
ACL