English-Russian Data Augmentation for Neural Machine Translation

Nikita Teslenko Grygoryev, Mercedes Garcia Martinez, Francisco Casacuberta Nolla, Amando Estela Pastor, Manuel Herranz


Abstract
Data Augmentation (DA) refers to strategies for increasing the diversity of training examples without explicitly collecting new data manually. We have used neural networks and linguistic resources for the automatic generation of text in Russian. The system generates new texts using information from embeddings trained with a huge amount of data in neural language models. Data from the public domain have been used for experiments. The generation of these texts increases the corpus used to train models for NLP tasks, such as machine translation. Finally, an analysis of the results obtained evaluating the quality of generated texts has been carried out and those texts have been added to the training process of Neural Machine Translation (NMT) models. In order to evaluate the quality of the NMT models, firstly, these models have been compared performing a quantitative analysis by means of several standard automatic metrics used in machine translation, and measuring the time spent and the amount of text generated for a good use in the language industry. Secondly, NMT models have been compared through a qualitative analysis, where generated examples of translation have been exposed and compared with each other. Using our DA method, we achieve better results than a baseline model by fine tuning NMT systems with the newly generated datasets.
Anthology ID:
2022.amta-coco4mt.1
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation)
Month:
September
Year:
2022
Address:
Editors:
John E. Ortega, Marine Carpuat, William Chen, Katharina Kann, Constantine Lignos, Maja Popovic, Shabnam Tafreshi
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/2022.amta-coco4mt.1
DOI:
Bibkey:
Cite (ACL):
Nikita Teslenko Grygoryev, Mercedes Garcia Martinez, Francisco Casacuberta Nolla, Amando Estela Pastor, and Manuel Herranz. 2022. English-Russian Data Augmentation for Neural Machine Translation. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 2: Corpus Generation and Corpus Augmentation for Machine Translation), pages 1–10, None. Association for Machine Translation in the Americas.
Cite (Informal):
English-Russian Data Augmentation for Neural Machine Translation (Teslenko Grygoryev et al., AMTA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.amta-coco4mt.1.pdf