Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021

Seamus Lankford, Haithem Afli, Andy Way


Abstract
Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highestperforming model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.
Anthology ID:
2021.mtsummit-loresmt.15
Volume:
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Month:
August
Year:
2021
Address:
Virtual
Venue:
LoResMT
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
144–150
Language:
URL:
https://aclanthology.org/2021.mtsummit-loresmt.15
DOI:
Bibkey:
Cite (ACL):
Seamus Lankford, Haithem Afli, and Andy Way. 2021. Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 144–150, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021 (Lankford et al., LoResMT 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.mtsummit-loresmt.15.pdf