The pay-offs of preprocessing for German-English statistical machine translation

Ilknur Durgar El-Kahlout, Francois Yvon


Abstract
In this paper, we present the result of our work on improving the preprocessing for German-English statistical machine translation. We implemented and tested various improvements aimed at i) converting German texts to the new orthographic conventions; ii) performing a new tokenization for German; iii) normalizing lexical redundancy with the help of POS tagging and morphological analysis; iv) splitting German compound words with frequency based algorithm and; v) reducing singletons and out-of-vocabulary words. All these steps are performed during preprocessing on the German side. Combining all these processes, we reduced 10% of the singletons, 2% OOV words, and obtained 1.5 absolute (7% relative) BLEU improvement on the WMT 2010 German to English News translation task.
Anthology ID:
2010.iwslt-papers.6
Volume:
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers
Month:
December 2-3
Year:
2010
Address:
Paris, France
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
251–258
Language:
URL:
https://aclanthology.org/2010.iwslt-papers.6
DOI:
Bibkey:
Cite (ACL):
Ilknur Durgar El-Kahlout and Francois Yvon. 2010. The pay-offs of preprocessing for German-English statistical machine translation. In Proceedings of the 7th International Workshop on Spoken Language Translation: Papers, pages 251–258, Paris, France.
Cite (Informal):
The pay-offs of preprocessing for German-English statistical machine translation (El-Kahlout & Yvon, IWSLT 2010)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2010.iwslt-papers.6.pdf