Abstract
Unseen words, also called out-of-vocabulary words (OOVs), are difficult for machine translation. In neural machine translation, byte-pair encoding can be used to represent OOVs, but they are still often incorrectly translated. We improve the translation of OOVs in NMT using easy-to-obtain monolingual data. We look for OOVs in the text to be translated and translate them using simple-to-construct bilingual word embeddings (BWEs). In our MT experiments we take the 5-best candidates, which is motivated by intrinsic mining experiments. Using all five of the proposed target language words as queries we mine target-language sentences. We then back-translate, forcing the back-translation of each of the five proposed target-language OOV-translation-candidates to be the original source-language OOV. We show that by using this synthetic data to fine-tune our system the translation of OOVs can be dramatically improved. In our experiments we use a system trained on Europarl and mine sentences containing medical terms from monolingual data.- Anthology ID:
- P19-1581
- Volume:
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Anna Korhonen, David Traum, Lluís Màrquez
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5809–5815
- Language:
- URL:
- https://aclanthology.org/P19-1581
- DOI:
- 10.18653/v1/P19-1581
- Cite (ACL):
- Matthias Huck, Viktor Hangya, and Alexander Fraser. 2019. Better OOV Translation with Bilingual Terminology Mining. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5809–5815, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Better OOV Translation with Bilingual Terminology Mining (Huck et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/P19-1581.pdf