Using Comparable Corpora to Adapt a Translation Model to Domains

Hiroyuki Kaji; Takashi Tsunakawa; Daisuke Okada

Using Comparable Corpora to Adapt a Translation Model to Domains

Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada

Abstract

Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words associated with a source-language word, presently restricted to a noun, and its translations; word translation pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the parameters and an extension to estimate translation pseudo-probabilities for verbs.

Anthology ID:: L10-1306
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/443_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Hiroyuki Kaji, Takashi Tsunakawa, and Daisuke Okada. 2010. Using Comparable Corpora to Adapt a Translation Model to Domains. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Using Comparable Corpora to Adapt a Translation Model to Domains (Kaji et al., LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/443_Paper.pdf

PDF Search