Morphological pre-processing for Turkish to English statistical machine translation

Arianna Bisazza, Marcello Federico


Abstract
We tried to cope with the complex morphology of Turkish by applying different schemes of morphological word segmentation to the training and test data of a phrase-based statistical machine translation system. These techniques allow for a considerable reduction of the training dictionary, and lower the out-of-vocabulary rate of the test set. By minimizing differences between lexical granularities of Turkish and English we can produce more refined alignments and a better modeling of the translation task. Morphological segmentation is highly language dependent and requires a fair amount of linguistic knowledge in its development phase. Yet it is fast and light-weight – does not involve syntax – and appears to benefit our IWSLT09 system: our best segmentation scheme associated to a simple lexical approximation technique achieved a 50% reduction of out-of-vocabulary rate and over 5 point BLEU improvement above the baseline.
Anthology ID:
2009.iwslt-papers.1
Volume:
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers
Month:
December 1-2
Year:
2009
Address:
Tokyo, Japan
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
129–135
Language:
URL:
https://aclanthology.org/2009.iwslt-papers.1
DOI:
Bibkey:
Cite (ACL):
Arianna Bisazza and Marcello Federico. 2009. Morphological pre-processing for Turkish to English statistical machine translation. In Proceedings of the 6th International Workshop on Spoken Language Translation: Papers, pages 129–135, Tokyo, Japan.
Cite (Informal):
Morphological pre-processing for Turkish to English statistical machine translation (Bisazza & Federico, IWSLT 2009)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2009.iwslt-papers.1.pdf
Presentation:
 2009.iwslt-papers.1.Presentation.pdf