Automatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages

John Hewitt, Matt Post, David Yarowsky


Abstract
Statistical Machine Translation (SMT) of highly inflected, low-resource languages suffers from the problem of low bitext availability, which is exacerbated by large inflectional paradigms. When translating into English, rich source inflections have a high chance of being poorly estimated or out-of-vocabulary (OOV). We present a source language-agnostic system for automatically constructing phrase pairs from foreign-language inflections and their morphological analyses using manually constructed datasets, including Wiktionary. We then demonstrate the utility of these phrase tables in improving translation into English from Finnish, Czech, and Turkish in simulated low-resource settings, finding substantial gains in translation quality. We report up to +2.58 BLEU in a simulated low-resource setting and +1.65 BLEU in a moderateresource setting. We release our morphologically-motivated translation models, with tens of thousands of inflections in each of 8 languages.
Anthology ID:
2016.amta-researchers.14
Volume:
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track
Month:
October 28 - November 1
Year:
2016
Address:
Austin, TX, USA
Editors:
Spence Green, Lane Schwartz
Venue:
AMTA
SIG:
Publisher:
The Association for Machine Translation in the Americas
Note:
Pages:
177–190
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2016.amta-researchers.14/
DOI:
Bibkey:
Cite (ACL):
John Hewitt, Matt Post, and David Yarowsky. 2016. Automatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages. In Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track, pages 177–190, Austin, TX, USA. The Association for Machine Translation in the Americas.
Cite (Informal):
Automatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages (Hewitt et al., AMTA 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2016.amta-researchers.14.pdf
Code
 john-hewitt/morph16