Takashi Tsunakawa

Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words associated with a source-language word, presently restricted to a noun, and its translations; word translation pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the parameters and an extension to estimate translation pseudo-probabilities for verbs.

pdf
Augmenting a Bilingual Lexicon with Information for Word Translation Disambiguation
Takashi Tsunakawa | Hiroyuki Kaji
Proceedings of the Eighth Workshop on Asian Language Resouces

2008

pdf abs
Improving English-to-Chinese Translation for Technical Terms using Morphological Information
Xianchao Wu | Naoaki Okazaki | Takashi Tsunakawa | Jun’ichi Tsujii
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

The continuous emergence of new technical terms and the difficulty of keeping up with neologism in parallel corpora deteriorate the performance of statistical machine translation (SMT) systems. This paper explores the use of morphological information to improve English-to-Chinese translation for technical terms. To reduce the morpheme-level translation ambiguity, we group the morphemes into morpheme phrases and propose the use of domain information for translation candidate selection. In order to find correspondences of morpheme phrases between the source and target languages, we propose an algorithm to mine morpheme phrase translation pairs from a bilingual lexicon. We also build a cascaded translation model that dynamically shifts translation units from phrase level to word and morpheme phrase levels. The experimental results show the significant improvements over the current phrase-based SMT systems.

pdf
Building a Bilingual Lexicon Using Phrase-based Statistical Machine Translation via a Pivot Language
Takashi Tsunakawa | Naoaki Okazaki | Jun’ichi Tsujii
Coling 2008: Companion volume: Posters

pdf
Bilingual Synonym Identification with Spelling Variations
Takashi Tsunakawa | Jun’ichi Tsujii
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf abs
Building Bilingual Lexicons using Lexical Translation Probabilities via Pivot Languages
Takashi Tsunakawa | Naoaki Okazaki | Jun’ichi Tsujii
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes a method of increasing the size of a bilingual lexicon obtained from two other bilingual lexicons via a pivot language. When we apply this approach, there are two main challenges, ambiguity and mismatch of terms; we target the latter problem by improving the utilization ratio of the bilingual lexicons. Given two bilingual lexicons between language pairs Lf-Lp and Lp-Le, we compute lexical translation probabilities of word pairs by using a statistical word-alignment model, and term decomposition/composition techniques. We compare three approaches to generate the bilingual lexicon: exact merging, word-based merging, and our proposed alignment-based merging. In our method, we combine lexical translation probabilities and a simple language model for estimating the probabilities of translation pairs. The experimental results show that our method could drastically improve the number of translation terms compared to the two methods mentioned above. Additionally, we evaluated and discussed the quality of the translation outputs.

Co-authors

Venues

amta1
coling2
ijcnlp2
lrec2
mtsummit1
show all...

ws2