Abstract
Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain’s data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propose a framework that efficiently and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers.With the collected parallel data, we can quickly adapt a machine translation model to the target domain. Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost. We tested it with five domains, and the domain-adapted model improved the BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose translation model.- Anthology ID:
- 2022.emnlp-industry.62
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, UAE
- Editors:
- Yunyao Li, Angeliki Lazaridou
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 606–618
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-industry.62
- DOI:
- 10.18653/v1/2022.emnlp-industry.62
- Cite (ACL):
- Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2022. Domain Adaptation of Machine Translation with Crowdworkers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 606–618, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- Domain Adaptation of Machine Translation with Crowdworkers (Morishita et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2022.emnlp-industry.62.pdf