Europarl: A Parallel Corpus for Statistical Machine Translation

Philipp Koehn


Abstract
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.
Anthology ID:
2005.mtsummit-papers.11
Volume:
Proceedings of Machine Translation Summit X: Papers
Month:
September 13-15
Year:
2005
Address:
Phuket, Thailand
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
79–86
Language:
URL:
https://aclanthology.org/2005.mtsummit-papers.11
DOI:
Bibkey:
Cite (ACL):
Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
Cite (Informal):
Europarl: A Parallel Corpus for Statistical Machine Translation (Koehn, MTSummit 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2005.mtsummit-papers.11.pdf