Abstract
Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.- Anthology ID:
- 2008.amta-papers.7
- Volume:
- Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
- Month:
- October 21-25
- Year:
- 2008
- Address:
- Waikiki, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 89–96
- Language:
- URL:
- https://aclanthology.org/2008.amta-papers.7
- DOI:
- Cite (ACL):
- Steve DeNeefe, Ulf Hermjakob, and Kevin Knight. 2008. Overcoming Vocabulary Sparsity in MT Using Lattices. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers, pages 89–96, Waikiki, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- Overcoming Vocabulary Sparsity in MT Using Lattices (DeNeefe et al., AMTA 2008)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2008.amta-papers.7.pdf