Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation

Gorka Labaka; Iñaki Alegría; Kepa Sarasola

Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation

Gorka Labaka, Iñaki Alegria, Kepa Sarasola

Abstract

This paper presents how an state-of-the-art SMT system is enriched by using an extra in-domain parallel corpora extracted from Wikipedia. We collect corpora from parallel titles and from parallel fragments in comparable articles from Wikipedia. We carried out an evaluation with a double objective: evaluating the quality of the extracted data and evaluating the improvement due to the domain-adaptation. We think this can be very useful for languages with limited amount of parallel corpora, where in-domain data is crucial to improve the performance of MT sytems. The experiments on the Spanish-English language pair improve a baseline trained with the Europarl corpus in more than 2 points of BLEU when translating in the Computer Science domain.

Anthology ID:: L16-1351
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2209–2213
Language:
URL:: https://aclanthology.org/L16-1351
DOI:
Bibkey:
Cite (ACL):: Gorka Labaka, Iñaki Alegria, and Kepa Sarasola. 2016. Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2209–2213, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation (Labaka et al., LREC 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/remove-xml-comments/L16-1351.pdf

PDF Search