Domain-Specific Text Generation for Machine Translation

Yasmin Moslem, Rejwanul Haque, John Kelleher, Andy Way


Abstract
Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.
Anthology ID:
2022.amta-research.2
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Editors:
Kevin Duh, Francisco Guzmán
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
14–30
Language:
URL:
https://aclanthology.org/2022.amta-research.2
DOI:
Bibkey:
Cite (ACL):
Yasmin Moslem, Rejwanul Haque, John Kelleher, and Andy Way. 2022. Domain-Specific Text Generation for Machine Translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 14–30, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Domain-Specific Text Generation for Machine Translation (Moslem et al., AMTA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2022.amta-research.2.pdf
Code
 ymoslem/mt-preparation +  additional community code