Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction

Pintu Lohar, Sinead Madden, Edmond O’Connor, Maja Popovic, Tanya Habruseva


Abstract
Building Machine Translation systems for a specific domain requires a sufficiently large and good quality parallel corpus in that domain. However, this is a bit challenging task due to the lack of parallel data in many domains such as economics, science and technology, sports etc. In this work, we build English-to-French translation systems for software product descriptions scraped from LinkedIn website. Moreover, we developed a first-ever test parallel data set of product descriptions. We conduct experiments by building a baseline translation system trained on general domain and then domain-adapted systems using sentence-embedding based corpus filtering and domain-specific sub-corpora extraction. All the systems are tested on our newly developed data set mentioned earlier. Our experimental evaluation reveals that the domain-adapted model based on our proposed approaches outperforms the baseline.
Anthology ID:
2022.amta-research.1
Volume:
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Editors:
Kevin Duh, Francisco Guzmán
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
1–13
Language:
URL:
https://aclanthology.org/2022.amta-research.1
DOI:
Bibkey:
Cite (ACL):
Pintu Lohar, Sinead Madden, Edmond O’Connor, Maja Popovic, and Tanya Habruseva. 2022. Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction. In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 1–13, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction (Lohar et al., AMTA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2022.amta-research.1.pdf