Abstract
We introduce neural methods and a toxicity filtering step to the hierarchical web mining approach of Paracrawl (Bañón et al., 2020), showing large improvements. We apply these methods to web-scale parallel corpus mining for 9 South and East Asian national languages, creating training resources for machine translation that yield better translation quality for most of these languages than existing publicly available datasets in OPUS. Our methods also generally lead to better results than the global mining approach of Schwenk et al. (2021).- Anthology ID:
- 2024.wmt-1.132
- Volume:
- Proceedings of the Ninth Conference on Machine Translation
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
- Venue:
- WMT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1454–1466
- Language:
- URL:
- https://aclanthology.org/2024.wmt-1.132
- DOI:
- 10.18653/v1/2024.wmt-1.132
- Cite (ACL):
- Philipp Koehn. 2024. Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages. In Proceedings of the Ninth Conference on Machine Translation, pages 1454–1466, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages (Koehn, WMT 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.wmt-1.132.pdf