This paper proposes a method using the existing Rule-based Machine Translation (RBMT) system as a black box to produce synthetic bilingual corpus, which will be used as training data for the Statistical Machine Translation (SMT) system.
With the synthetic bilingual corpus, we can build an SMT system even if there is no real bilingual corpus.
In our experiments using BLEU as a metric, the system achieves a relative improvement of 11.7% over the best RBMT system that is used to produce the synthetic bilingual corpora.
We also interpolate the model trained on a real bilingual corpus and the models trained on the synthetic bilingual corpora.
The interpolated model achieves an absolute improvement of 0.0245 BLEU score (13.1% relative) as compared with the individual model trained on the real bilingual corpus.
1 Introduction
Within the Machine Translation (MT) field, by far the most dominant paradigm is SMT, but many existing commercial systems are rule-based.
In this research, we are interested in answering the question of whether the existing RBMT systems could be helpful to the development of an SMT system.
To find the answer, let us first consider the following facts:
• Existing RBMT systems are usually provided as a black box.
To make use of such systems, the most convenient way might be working on the translation results directly.
• SMT methods rely on bilingual corpus.
As a data driven method, SMT usually needs large bilingual corpus as the training data.
Based on the above facts, in this paper we propose a method using the existing RBMT system as a black box to produce a synthetic bilingual cor-pus1, which will be used as the training data for the SMT system.
For a given language pair, the monolingual corpus is usually much larger than the real bilingual corpus.
We use the existing RBMT system to translate the monolingual corpus into synthetic bilingual corpus.
Then, even if there is no real bilingual corpus, we can train an SMT system with the monolingual corpus and the synthetic bilingual corpus.
If there exist n available RBMT systems for the desired language pair, we use the n systems to produce n synthetic bilingual corpora, and n translation models are trained with the n corpora respectively.
We name such a model the synthetic model.
An interpolated translation model is built by linear interpolating the n synthetic models.
In our experiments using BLEU (Papineni et al., 2002) as the metric, the interpolated synthetic model achieves a relative improvement of 11.7% over the best RBMT system that is used to produce the synthetic bilingual corpora.
1 In this paper, to be distinguished from the real bilingual corpus, the bilingual corpus generated by the RBMT system is called a synthetic bilingual corpus.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 287-295, Prague, June 2007.
©2007 Association for Computational Linguistics
Moreover, if a real bilingual corpus is available for the desired language pair, we build another translation model, which is named the standard model.
Then we can build an interpolated model by interpolating the standard model and the synthetic models.
Experimental results show that the interpolated model achieves an absolute improvement of 0.0245 BLEU score (13.1% relative) as compared with the standard model.
The remainder of this paper is organized as follows.
In section 2 we summarize the related work.
We then describe our method Using RBMT systems to produce bilingual corpus for SMT in section 3.
Section 4 describes the resources used in the experiments.
Section 5 presents the experiment result, followed by the discussion in section 6.
Finally, we conclude and present the future work in section 7.
2 Related Work
In the MT field, by far the most dominant paradigm is SMT.
SMT has evolved from the original word-based approach (Brown et al., 1993) into phrase-based approaches (Koehn et al., 2003; Och and Ney, 2004) and syntax-based approaches (Wu, 1997; Alshawi et al., 2000; Yamada and Knignt, 2001; Chiang, 2005).
On the other hand, much important work continues to be carried out in Example-Based Machine Translation (EBMT) (Carl et al., 2005; Way and Gough, 2005), and many existing commercial systems are rule-based.
Although we are not aware of any previous attempt to use an existing RBMT system as a black box to produce synthetic bilingual training corpus for general purpose SMT systems, there exists a great deal of work on MT hybrids and Multi-Engine Machine Translation (MEMT).
framework with phrase-based SMT for spoken language translation in a limited domain.
They automatically generated a corpus of English-Chinese pairs from the same interlingual representation by parsing the English corpus and then paraphrasing each utterance into both English and Chinese.
Frederking and Nirenburg (1994) produced the first MEMT system by combining outputs from three different MT engines based on their knowledge of the inner workings of the engines.
Nomoto (2004) used voted language models to select the best output string at sentence level.
Some recent approaches to MEMT used word alignment techniques for comparison between the MT systems (Jayaraman and Lavie, 2005; Zaanen and Somers,
systems operate on MT outputs for complete input sentences.
Mellebeek et al. (2006) presented a different approach, using a recursive decomposition algorithm that produces simple chunks as input to the MT engines.
A consensus translation is produced by combining the best chunk translation.
This paper uses RBMT outputs to improve the performance of SMT systems.
Instead of RBMT outputs, other researchers have used SMT outputs to boost translation quality.
Callision-Burch and Osborne (2003) used co-training to extend existing parallel corpora, wherein machine translations are selectively added to training corpora with multiple source texts.
They also created training data for a language pair without a parallel corpus by using multiple source texts.
Ueffing (2006) explored monolingual source-language data to improve an existing machine translation system via self-training.
The source data is translated by a SMT system, and the reliable translations are automatically identified.
Both of the methods improved translation quality.
In this paper, we use the synthetic and real bilingual corpus to train the phrase-based translation models.
According to the translation model presented in (Koehn et al., 2003), given a source sentence f, the best target translation ebest can be obtained using the following model
ficients, ensuring ^ a = 1 and ^lfJi = 1.
Where the translation model p(f | e) can be decomposed into
Where <( fi | ei) is the phrase translation probability. ai denotes the start position of the source phrase that was translated into the ith target phrase, and bi-1 denotes the end position of the source phrase translated into the (i-1)th target phrase. d(ai - bi-1) is the distortion probability.
pw (f i | ei, a) is the lexical weight, and X is the strength of the lexical weight.
3.2 Interpolated Models
We train synthetic models with the synthetic bilingual corpus produced by the RBMT systems.
We can also train a translation model, namely standard model, if a real bilingual corpus is available.
In order to make full use of these two kinds of corpora, we conduct linear interpolation between them.
In this paper, the distortion probability in equation (2) is estimated during decoding, using the same method as described in Pharaoh (Koehn, 2004).
For the phrase translation probability and lexical weight, we interpolate them as shown in (3)
and (4).
phrase translation probability and lexical weight trained with the real bilingual corpus, respectively.
phrase translation probability and lexical weight estimated by n synthetic corpora produced by the
RBMT systems. ai and J3i are interpolation coef-
4 Resources Used in Experiments
In the experiments, we take English-Chinese translation as a case study.
The real bilingual corpus includes 494,149 English-Chinese bilingual sentence pairs.
The monolingual English corpus is selected from the English Gigaword Second Edition, which is provided by Linguistic Data Consortium (LDC) (catalog number LDC2005T12).
The selected monolingual corpus includes 1,087,651 sentences.
For language model training, we use part of the Chinese Gigaword Second Edition provided by LDC (catalog number LDC2005T14).
We use 41,418 documents selected from the ZaoBao Newspaper and 992,261 documents from the XinHua News Agency to train the Chinese language model, amounting to 5,398,616 sentences.
The test set and the development set are from
evaluation of machine translation.
It can be obtained from Chinese Linguistic Data Consortium (catalog number 2005-863-001).
We use the same 494 sentences in the test set and 278 sentences in the development set.
Each source sentence in the test set and the development set has 4 different references.
In this paper, we use two off-the-shelf commercial English to Chinese RBMT systems to produce the synthetic bilingual corpus.
We also need a trainer and a decoder to perform phrase-based SMT.
We use Koehn's training scripts 3 to train the translation model, and the SRILM toolkit (Stolcke, 2002) to train language model.
For the decoder, we use Pharaoh (Koehn, 2004).
We run the decoder with its default settings (maximum phrase length 7) and then use Koehn's implementation of minimum error rate training (Och, 2003) to tune the feature weights on the de-
2 The full name of HTRDP is National High Technology Research and Development Program of China, also named as 863 Program.
3 It is located at http://www.statmt.org/wmt06/shared-task/baseline.html.
velopment set.
The translation quality is evaluated using a well-established automatic measure: BLEU score (Papineni et al., 2002).
We use the same method described in (Koehn and Monz, 2006) to perform the significance test.
5 Experimental Results 5.1 Results on Synthetic Corpus Only
With the monolingual English corpus and the English side of the real bilingual corpus, we translate them into Chinese using the two commercial RBMT systems and produce two synthetic bilingual corpora.
With the corpora, we train two synthetic models as described in section 3.1.
Based on the synthetic models, we also perform linear interpolation as shown in section 3.2, without the standard models.
We tune the interpolation weights using the development set, and achieve the best performance when a1 = 0.58 , a2 = 0.42 ,
JJ1 = 0.58 , and JJ2 = 0.42 .
The translation results on the test set are shown in Table 1.
Synthetic model 1 and 2 are trained using the synthetic bilingual corpora produced by RBMT system 1 and RBMT system 2, respectively.
RBMT system 1
RBMT system 2
Interpolated Synthetic Model
Table 1.
Translation Results Using Synthetic Bilingual Corpus
From the results, it can be seen that the interpolated synthetic model obtains the best result, with an absolute improvement of the 0.0197 BLEU (11.7% relative) as compared with RBMT system 1, and 0.0425 BLEU (29.2% relative) as compared with RBMT system 2.
It is very promising that our method can build an SMT system that significantly outperforms both of the two RBMT systems, using the synthetic bilingual corpus produced by two
RBMT systems.
5.2 Results on Real and Synthetic Corpus
With the real bilingual corpus, we build a standard model.
We interpolate the standard model with the two synthetic models built in section 5.1 to obtain
interpolated models.
The translation results are shown in Table 2.
The interpolation coefficients are both for phrase table probabilities and lexical weights.
They are also tuned using the development set.
From the results, it can be seen that all the three interpolated models perform not only better than the RBMT systems but also better than the SMT system trained on the real bilingual corpus.
The interpolated model combining the standard model and the two synthetic models performs the best, achieving a statistically significant improvement of about 0.0245 BLEU (13.1% relative) as compared with the standard model with no synthetic corpus.
It also achieves 26.1% and 45.8% relative improvement as compared with the two RBMT systems respectively.
The results indicate that using the corpus produced by RBMT systems, the performance of the SMT system can be greatly improved.
The results also indicate that the more the RBMT systems are used, the better the translation quality is.
Table 2.
Translation Results Using Standard and Synthetic Bilingual Corpus
5.3 Effect of Synthetic Corpus Size
To explore the relationship between the translation quality and the scale of the synthetic bilingual corpus, we interpolate the standard model with the synthetic models trained with synthetic bilingual corpus of different sizes.
In order to simplify the procedure, we only use RBMT system 1 to translate the 1,087,651 monolingual English sentences to produce the synthetic bilingual corpus.
100% of the synthetic bilingual corpus to train different synthetic models.
The translation results of the interpolated models are shown in Figure 1.
The results indicate that the larger the synthetic bilingual corpus is, the better translation performance would be.
Real Bilingual Corpus (%)
Figure 1.
Comparison of Translation Results Using Synthetic Bilingual Corpus of Different Sizes
Another issue is the relationship between the SMT performance and the size of the real bilingual corpus.
To train different standard models, we randomly build five corpora of different sizes, which contain 20%, 40%, 60%, 80%, and 100% sentence pairs of the real bilingual corpus, respectively.
As to the synthetic model, we use the same synthetic model 1 that is described in section 5.1.
Then we build five interpolated models by performing linear interpolation between the synthetic model and the five standard models respectively.
The translation results are shown in Figure 2.
From the results, we can see that the larger the real bilingual corpus is, the better the performance of both standard models and interpolated models would be.
The relative improvement of BLEU scores is up to 27.5% as compared with the corresponding standard models.
5.5 Results without Additional Monolingual Corpus
In all the above experiments, we use an additional English monolingual corpus to get more synthetic bilingual corpus.
We are also interested in the results without the additional monolingual corpus.
In such case, the only English monolingual corpus is the English side of the real bilingual corpus.
We use this smaller size of monolingual corpus and the real bilingual corpus to conduct similar experiments as in section 5.2.
The translation results are shown in Table 3.
From the results, it can be seen that our method works well even if no additional monolingual corpus is available.
We achieve a statistically signifi-
Figure 2.
Comparison of Translation Results Using Real Bilingual Corpus of Different Sizes
Interpolation Coefficients
Standard
Table 3.
Translation Results without Additional Monolingual Corpus
Synthetic Model 1
Synthetic Model 2
Synthetic
Table 4.
Numbers of Phrase Pairs
cant improvement of about 0.01 BLEU (5.2% relative) as compared with the standard model without using the synthetic corpus.
In order to further analyze the translation results, we examine the overlap and the difference among the phrase tables.
The analytic results are shown in Table 4.
More phrase pairs are extracted by the synthetic models, about twice by the synthetic model 1 in particular, than those extracted by the standard model.
The overlap between each model is very low.
For example, about 6% phrase pairs extracted by the standard model make appearance in both the standard model and the synthetic model 1.
This also explains why the interpolated model outperforms that of the standard model in Table 3.
Methods__English Sentence / Chinese Translations__BLEU
This move helps spur the enterprise to strengthen technical innovation, management innovation and the creation of a brand name and to strengthen marketing, after-sale service, thereby fundamentally enhance the enterprise's competitiveness;
Standard model
RBMT System 1
RBMT System 2
Table 5.
Translation Example
This move
to strengthen
technical
, management
innovation-
innovation
d the creation of a
and the creation of
brand name-
a brand name
and to strengthen
marketing ,
after-sale service
, thereby
fundamentally
enhance the
enterprise 's
the enterprise
competitiveness
's competitiveness
(shouhou)
(he chuangzao)
(shouhoufuwu)
(a) Results Produced by the Standard Model
(b) Results Produced by the Interpolated Model
Figure 3.
Phrase Pairs Used for Translation
6 Discussion
6.1 Model Interpolation vs. Corpus Merge
In section 5, we make use of the real bilingual corpus and the synthetic bilingual corpora by performing model interpolation.
Another available way is directly combining these two kinds of corpora to train a translation model, namely corpus merge.
In order to compare these two methods, we use RBMT system 1 to translate the 1,087,651 monolingual English sentences to produce synthetic bilingual corpus.
Then we train an SMT system with the combination of this synthetic bilingual corpus and the real bilingual corpus.
The BLEU score of such system is 0.1887, while that of the model interpolation system is 0.2020.
It indicates that the model interpolation method is significantly better than the corpus merge method.
As discussed in Section 5.5, the number of the overlapped phrase pairs among the standard model and the synthetic models is very small.
The newly added phrase pairs from the synthetic models can assist to improve the translation results of the interpolated model.
In this section, we will use an example to further discuss the reason behind the improvement of the SMT system by using synthetic bilingual corpus.
Table 5 shows an English sentence and its Chinese translations produced by different methods.
And Figure 3 shows the phrase pairs used for translation.
The results show that imperfect translations of RBMT systems can be also used to boost the performance of an SMT system.
Phrase Pairs
New Pairs
Standard Model
Interpolated Model
Table 6.
Statistics of Phrase Pairs
Further analysis is shown in Table 6.
After adding the synthetic corpus produced by the RBMT systems, the interpolated model outperforms the standard models mainly for the following two reasons: (1) some new phrase pairs are added into the interpolated model.
37.6% phrase pairs (1993 out
of 5306) are newly learned and used for translation.
For example, the phrase pair "after-sale service <-> Uü=fjK# (shouhoufuwu)" is added; (2) The probability distribution of the phrase pairs is changed.
For example, the probabilities of the two pairs "a brand name <-> (pinpai)" and "and the creation of <-> f "frjii; (he chuangzao)" increase.
The probabilities of the other two pairs "brand name <-> rhW (pinpai)" and "and the creation of a <-> fp IÜja (he jianli)" decrease.
We found that 930 phrase pairs, which are also in the phrase table of the standard model, are used by the interpolated model for translation but not used by the standard model.
According to (Koehn and Monz, 2006; Callison-Burch et al., 2006), the RBMT systems are usually not adequately appreciated by BLEU.
We also manually evaluated the RBMT systems and SMT systems in terms of both adequacy and fluency as defined in (Koehn and Monz, 2006).
The evaluation results show that the SMT system with the interpolated model, which achieves the highest BLEU scores in Table 2, achieves slightly better adequacy and fluency scores than the two RBMT systems.
7 Conclusion and Future Work
We presented a method using the existing RBMT system as a black box to produce synthetic bilingual corpus, which was used as training data for the SMT system.
We used the existing RBMT system to translate the monolingual corpus into a synthetic bilingual corpus.
With the synthetic bilingual corpus, we could build an SMT system even if there is no real bilingual corpus.
In our experiments using BLEU as the metric, such a system achieves a relative improvement of 11.7% over the best RBMT system that is used to produce the synthetic bilingual corpora.
It indicates that using the existing RBMT systems to produce a synthetic bilingual corpus, we can build an SMT system that outperforms the existing RBMT systems.
We also interpolated the model trained on a real bilingual corpus and the models trained on the synthetic bilingual corpora, the interpolated model achieves an absolute improvement of 0.0245 BLEU score (13.1% relative) as compared with the individual model trained on the real bilingual cor-
pus.
It indicates that we can build a better SMT system by leveraging the real and the synthetic bilingual corpus.
Further result analysis shows that after adding the synthetic corpus produced by the RBMT systems, the interpolated model outperforms the standard models mainly because of two reasons: (1) some new phrase pairs are added to the interpolated model; (2) the probability distribution of the phrase pairs is changed.
In the future work, we will investigate the possibility of training a reverse SMT system with the RBMT systems.
For example, we will investigate to train Chinese-to-English SMT system based on natural English and RBMT-generated synthetic Chinese.
