Abstract
We tried to cope with the complex morphology of Turkish by applying different schemes of morphological word segmentation to the training and test data of a phrase-based statistical machine translation system. These techniques allow for a considerable reduction of the training dictionary, and lower the out-of-vocabulary rate of the test set. By minimizing differences between lexical granularities of Turkish and English we can produce more refined alignments and a better modeling of the translation task. Morphological segmentation is highly language dependent and requires a fair amount of linguistic knowledge in its development phase. Yet it is fast and light-weight – does not involve syntax – and appears to benefit our IWSLT09 system: our best segmentation scheme associated to a simple lexical approximation technique achieved a 50% reduction of out-of-vocabulary rate and over 5 point BLEU improvement above the baseline.- Anthology ID:
- 2009.iwslt-papers.1
- Volume:
- Proceedings of the 6th International Workshop on Spoken Language Translation: Papers
- Month:
- December 1-2
- Year:
- 2009
- Address:
- Tokyo, Japan
- Venue:
- IWSLT
- SIG:
- SIGSLT
- Publisher:
- Note:
- Pages:
- 129–135
- Language:
- URL:
- https://aclanthology.org/2009.iwslt-papers.1
- DOI:
- Cite (ACL):
- Arianna Bisazza and Marcello Federico. 2009. Morphological pre-processing for Turkish to English statistical machine translation. In Proceedings of the 6th International Workshop on Spoken Language Translation: Papers, pages 129–135, Tokyo, Japan.
- Cite (Informal):
- Morphological pre-processing for Turkish to English statistical machine translation (Bisazza & Federico, IWSLT 2009)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/2009.iwslt-papers.1.pdf