A Wikipedia-based Corpus for Contextualized Machine Translation
Jennifer Drexler, Pushpendre Rastogi, Jacqueline Aguilar, Benjamin Van Durme, Matt Post
Abstract
We describe a corpus for target-contextualized machine translation (MT), where the task is to improve the translation of source documents using language models built over presumably related documents in the target language. The idea presumes a situation where most of the information about a topic is in a foreign language, yet some related target-language information is known to exist. Our corpus comprises a set of curated English Wikipedia articles describing news events, along with (i) their Spanish counterparts and (ii) some of the Spanish source articles cited within them. In experiments, we translated these Spanish documents, treating the English articles as target-side context, and evaluate the effect on translation quality when including target-side language models built over this English context and interpolated with other, separately-derived language model data. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.- Anthology ID:
- L14-1150
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3593–3596
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1217_Paper.pdf
- DOI:
- Cite (ACL):
- Jennifer Drexler, Pushpendre Rastogi, Jacqueline Aguilar, Benjamin Van Durme, and Matt Post. 2014. A Wikipedia-based Corpus for Contextualized Machine Translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3593–3596, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- A Wikipedia-based Corpus for Contextualized Machine Translation (Drexler et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1217_Paper.pdf