l Introduction
We propose a domain specific model for statistical machine translation.
It is well-known that domain speciic language models perform well in automatic speech recognition.
We show that domain speciic language and translation models also beneit statistical machine translation.
However, there are two problems with using domain speciic models.
The irst is the data sparse-ness problem.
We employ an adaptation technique to overcome this problem.
The second issue is domain prediction.
In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically.
For these cases, not only the translation target sentence but also the domain must be predicted.
This paper focuses on the domain prediction problem for statistical machine translation.
In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora.
Each sub-corpus is deemed to be a domain.
The domain of a source sentence is predicted by using its similarity to the sub-corpora.
The predicted domain (sub-corpus) speciic language and translation models are then used for the translation decoding.
This approach gave an improvement of 2.7 in BLEU (Pa-pineni et al., 2002) score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1).
This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.
Statistical models, such as n-gram models, are widely used in natural language processing, for example in speech recognition and statistical machine translation (SMT).
The performance of a statistical model has been shown to improve when domain spe-ciic models are used, since similarity of statistical characteristics between model and target is higher.
For utilize of domain speciic models, a training data sparseness and target domain estimation problems must be resolved.
In this paper, we try to estimate target domain sentence by sentence, considering cases where the domain changes dynamically.
After sentence by sentence domain estimation, domain speciic models are used for translation using the adaptation technique(Seymore et al., 1997).
In order to train a classiier to predict the domain, we used an unsupervised clustering technique on an unlabelled bilingual training corpus.
We regarded each cluster (sub-corpus) as a domain.
Prior to translation, the domain of the source sentence is first predicted and this prediction is then used for model selection.
The most similar sub-corpus to the translation source sentence is used to represent its domain.
After the prediction is made, domain speciic language and translation models are used for the translation.
In Section 2 we present the formal basis for our domain specific translation method.
In Section 3 we provide a general overview of the two sub-tasks of domain speciic translation: domain prediction, and domain specific decoding.
Section 4 presents the domain prediction task in depth.
Section 5 offers a more detailed description of the details of domain specific decoding.
Section 6 gives details of the experiments and presents the results.
Finally, Section
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 514-523, Prague, June 2007.
©2007 Association for Computational Linguistics
7 offers a summary and some concluding remarks.
2 Domain Specific Models in SMT
The purpose of statistical machine translation is to ind the most probable translation in the target language e of a given source language sentence f. This search process can be expressed formally by:
In this formula, the target word sequence (sentence) e is determined only by the source language word sequence f. However, e is heavily dependent on not only on f but also on the domain D. When the domain D is given, formula (1) can be rewritten as the following formula with the introduction of a new probabilistic variable D.
This formula can be re-expressed using Bayes' Law.
Here, P(f |e, D) represents the domain D specific translation model and P(e| D) represents the domain D speciic language model.
When the domain D is known, domain speciic models can be created and used in the translation decoding process.
However, in many cases, domain D is unknown or changes dynamically.
In these cases, both the translation target language sentence e and the domain D must be dynamically predicted at the same time.
The following equation represents the process of domain speciic translation when the domain D is being dynamically predicted.
The major difference between this equation and formula (3) is that the probabilistic variable D is the prediction target in equation (4).
In this equation, P(D| f) represents the domain prediction and P(e| f, D) represents the domain speciic translation.
3 Outline of the Proposed Method
Our method can be analysed into two processes: an off-line process and an on-line process.
The processes are depicted in igure 1.
In the off-line process, bilingual sub-corpora are created by clustering and these clusters represent domains.
Domain spe-ciic models are then created from the data contained in the sub-corpora in a batch process.
In the on-line process, the domain of the source sentence is irst predicted and following this the sentence is translated using models built on data from the appropriate domain.
In this process, the training corpus is clustered to sub-corpora, which are regarded as domains.
In SMT, a bilingual corpus is used to create the translation model, and typically, bilingual data together with additional monolingual corpora are used to create the language model.
In our method, both the bilingual and monolingual corpora are clustered.
After clustering, cluster dependent (domain specific) language and translation models are created from the data in the clusters.
A bilingual corpus which is comprised of the training data for the translation model, or equivalently the bilingual part of the training data for the language model is clustered (see Section 4.2).
Each sentence of the additional monolingual corpora (if any) is assigned to a bilingual cluster (see Section 4.3).
For each cluster, the domain specific (cluster dependent) language models are created.
The domain speciic translation model is created using only the clusters formed from clustering bilingual data.
This process is comprised of domain prediction and the domain speciic translation components.
The following steps are taken for each source sentence.
Select the cluster to which the source sentence belongs.
Translate the source sentence using the appropriate domain speciic language and translation models.
4 Domain Prediction
This section details the domain prediction process.
To satisfy equation (4), both the domain D and the translation target word sequence e, which maximizes both P(Df) and P(ef,D) must be calculated at the same time.
However, it is dificult to make the calculations without an approximation.
Therefore, in the irst step, we ind the best candidates for D given the input sentence f. In the next step, P(e| f, D) is maximized over the candidates for D using the following formula.
Equation (5) is approximation offollowing equation in that can D is regarded as a hidden variable.
When the following assumptions are introduced to equation (6), equation (5) is obtained as an approximation.
For only one domain Di, P (Dif) is nearly equal to one.
For other domains, P (Df) are almost zero.
P(D| f) can be re-written as following equation.
Therefore, we can conirm reasonability of this assumption by calculating P(f| D)P(D) all domains (P(f) is constant).
4.1 Domain Definition
When the domain is known in advance, it is usually expressible, for example it could be a topic that matches a human-defined category like "sport".
On the other hand, when the domain is delimited in an unsupervised manner, it is used only as a probabilistic variable and does not need to be expressed.
Equation (4) illustrates that a good model will provide high probabilities to P(D| f)P(e| f, D)
for bilingual sentence pairs (f, e).
For the same reason, a good domain deinition will lead to a higher probability for the term: P (Df )P (ef, D).
Therefore, we deine the domain D as that which maximizes P(Df)P(e^) (an approximation of P(D| f)P(e| f, D)).
This approximation ensures that the domain deinition is optimal for only the language model rather than both the language and translation models.
P(D| f)P(e| D) can be rewritten as the following equation using Bayes' Law.
Here, P(f) is independent of domain D. Furthermore, we assume P(D) to be constant.
The following formula embodies the search for the optimal domain.
This formula ensures that the search for the domain maximizes the domain speciic probabilities of both e and f simultaneously.
4.2 Clustering of the bilingual corpus
As mentioned above, we maximize the domain specific probabilities of e and f to ascertain the domain.
We deine our domains as sub-corpora of the bilingual corpus, and these sub-corpora are formed by clustering bilingually by entropy reduction.
For this clustering, the following extension of monolingual corpus clustering is employed (Carter 1994).
The total number of clusters (domains) is given by the user.
Each bilingual sentence pair is randomly assigned to a cluster.
For each cluster, language models for e and f are created using the bilingual sentence pairs that belong to the cluster.
For each cluster, the entropy for e and f is calculated by applying the language models from the previous step to the sentences in the cluster.
The total entropy is deined as the total sum of entropy (for both source and target) for each cluster.
On-line process
Decoding
Translation result
Target language models
Source language models
Bilingual cluster
Target language
Source language
Off-line process
Bilingual corpus
Monolingual corpus
Figure 1: Outline of the Proposed Method
Each bilingual sentence pair is re-assigned to a cluster such that the assignment minimizes the total entropy.
The process is repeated from step (3) until the entropy reduction is smaller than a given threshold.
4.3 Clustering the monolingual corpus
Any additional monolingual corpora used to train the language model are also clustered.
For this clustering, the following process is used.
First, bilingual clusters are created using the above process.
For each monolingual sentence its entropy is calculated using all the bilingual cluster dependent language models and also the general language model (see Figure 1 for a description of the general language model).
If the entropy of the general language model is the lowest, this sentence is not used in the cluster dependent language models.
Otherwise, the monolingual sentence is added
to the bilingual cluster that results in the lowest entropy.
4.4 Domain prediction
In the process described in the previous section we describe how clusters are created, and we deine our domains in terms of these clusters.
In this step, domain D is predicted using the given source sentence f .
This prediction is equivalent to inding the D that maximizes P(D| f).
P(D| f) can be re-written as P(f D)P(D)/P(f) using Bayes' law.
Here, P(f) is a constant, and if P(D) is assumed to be constant (this approximation is also used in the clustering of the bilingual corpus), maximizing the target is reduced to the maximization ofP(f| D).
To maximize P(f| D) we simply select the cluster D, that gives the highest likelihood ofa given source sentence f.
5 Domain specific decoding
After domain prediction, domain speciic decoding to maximize P(ef,D), is conducted.
P(ef,D) can be re-written as the following equation using Bayes' law.
Here, f is a given constant and D has already been selected by the domain prediction process.
Therefore, maximizing P(f| e, D)P(e| D) is equivalent to maximizing the above equation.
In P(f ^, D)P(e\D), P(f ^, D) is the domain specific translation model and P(e| D) is the domain speciic language model.
Equation (10) represents the whole process of translation of f into e using domain D speciic models P(f e, D) and P(e D).
5.1 Differences from previous methods
Hasan et al. (2005) proposed a cluster language model for inding the domain D. This method has three steps.
In the irst step, the translation target language corpus is clustered using human-deined regular expressions.
In the second step, a regular expression is created from the source sentence f. In the last step, the cluster that corresponds to the extracted regular expression is selected, and the cluster speciic language model built from the data in this cluster is used for the translation.
The points of difference are:
• In the cluster language model, clusters are de-ined by human-deined regular expressions.
On the other hand, with the proposed method, clusters are automatically (without human knowledge) deined and created by the entropy reduction based method.
• In the cluster language model, only the translation target language corpus is clustered.
In the proposed method, both the translation source and target language corpora are clustered (bilingual clusters).
• In the cluster language model, only a domain (cluster) speciic language model is used.
In the proposed method, both a domain speciic language model and a domain speciic translation model are used.
5.1.2 Sentence mixture language model
P (f ^) is used instead of the domain specific translation model P(f D), this equation represents the process of translation using sentence mixture language models (Iyer et al., 1993) as follows:
The points that differ from the proposed method are as follows:
• In the sentence mixture model, the mixture weight parameters D\ are constant.
On the other hand, in the proposed method, weight parameters P(D| f) are estimated separately for each sentence.
• In the sentence mixture model, the probabilities of all cluster dependent language models are summed.
In the proposed model, only the cluster that gives the highest probability is considered as approximation.
• In the proposed method, a domain speciic translation model is also used.
6 Experiments
6.1 Japanese to English translation
To evaluate the proposed model, we conducted experiments based on a travel conversation task corpus.
The experimental corpus was the travel arrangements task of the BTEC corpus (Takezawa et al., 2002),(Kikui et al., 2003) and the language pair was Japanese and English.
The training, development, and evaluation corpora are shown in Table 1.
The development and evaluation corpora each had sixteen reference translations for each sentence.
This training corpus was also used for the IWSLT06 Evaluation Campaign on Spoken Language Translation (Paul 2006) J-E open track, and the evaluation corpus was used as the IWSLT05 evaluation set.
6.1.2 Experimental conditions
For bilingual corpus clustering, the sentence entropy must be calculated.
Unigram language models were used for this calculation.
The translation models were pharse-based (Zen et al., 2002) created using the GIZA++ toolkit (Och et al., 2003).
The language models for the domain prediction and translation decoding were word trigram with Good-Turing
Table 1: Japanese to English experimental corpus
Japanese Training
Japanese Development
English Development
Japanese Evaluation
backoff (Katz 1987).
Ten cluster specific source language models and a general language model were used for the domain prediction.
If the general language model provided the lowest perplexity for an input sentence, the domain speciic models were not used for this sentence.
The SRI language modeling toolkit (Stolcke) was used for the creation of all language models.
The PHARAOH phrase-based decoder (Koehn 2004) was used for the translation decoding.
For tuning of the decoder's parameters, including the language model weight, minimum error training (Och 2003) with respect to the BLEU score using was conducted using the development corpus.
These parameters were used for the baseline conditions.
During translation decoding, the domain spe-ciic language model was used as an additional feature in the log-linear combination according to the PHARAOH decoder's option.
That is, the general and domain speciic language models are combined by log-linear rather than linear interpolation.
The weight parameters for the general and domain spe-ciic language models were manually tuned using the development corpus.
The sum of these language model weights was equal to the language model weight in the baseline.
For the translation model, the general translation model (phrase table) and domain speciic translation model were linearly combined.
The interpolation parameter was again manually tuned using the development corpus.
In our bilingual clustering, the number of clusters must be ixed in advance.
Based on the results of preliminary experiments to estimate model order, ten clusters were used.
Ifless than ten clusters were used, domain speciic characteristics cannot be represented.
If more than ten clusters were used, data
sparseness problems are severe, especially in translation models.
The amount of sentences in each cluster is not so different, therefore the approximation that P(D) is reasonable.
Two samples ofbilin-gual clusters are recorded in the appendix "Sample of Cluster".
The cluster A.1 includes many interrogative sentences.
The reason is that special words "
at the end of Japanese sentence with no corresponding word used in English.
The cluster A.2 includes numeric expressions in both English and Japanese.
Next, we confirm the reasonability of the assumption used in equation(5).
For this conirmation, we calculate P(Df) for all D for each f (P(D) is approximated as constant).
For almost f, only one domain Di has a vary large value compared with other domains.
Therefore, this approximation is conirmed to be reasonable.
In this experiments, we compare three ways ofde-ploying our domain speciic models to a baseline.
In the irst method, only the domain speciic language model is used.
The ratio of the weight parameter for the general model to the domain speciic model was 6:4 for all the domain specific language models.
In the second method, only the domain speciic translation model was used.
The ratio of the interpolation parameter of the general model to the domain specific model was 3:7 for all the domain specific models.
In the last method, both the domain speciic language and translation models (LM+TM) were used.
The weights and interpolation parameters were the same as in the irst and second methods.
The experimental results are shown in Table 2.
Under all of the conditions and for all of the evaluation measures, the proposed domain speciic models gave better performance than the baseline.
The highest performance came from the system that used both the domain spe-ciic language and translation models, resulting in a
2.7 point BLEU score gain over the baseline.
It is a very respectable improvement.
Appendix "Sample of Different Translation Results" recodes samples of different translation results with and without the domain speciic language and translation models.
In many cases, better word order is obtained in with the domain speciic models.
6.2 Translation of ASR output
In this experiment, the source sentence used as input to the machine translation system was the direct textual output from an automatic speech recognition (ASR) decoder that was a component of a speech-to-speech translation system.
The input to our system therefore contained the kinds of recognition errors and disfluencies typically found in ASR output.
This experiment serves to determine the robustness of the domain prediction to real-world speech input.
The speech recognition process in this experiment had a word accuracy of 88.4% and a sentence accuracy of 67.2% .
The results shown in Table 3 clearly demonstrate that the proposed method is able to improve the translation performance, even when speech recognition errors are present in the input sentence.
6.3 Comparison with previous methods
In this section we compare the proposed method to other comtemporary methods: the cluster language model (CLM) and the sentence mixture model (SMix).
The experimental results for these methods were reported by RWTH Aachen University in IWSLT06 (Mauser et al., 2006).
We evaluated our method using the same training and evaluation corpora.
These corpora were used as the training and development corpora in the IWSLT06 Chinese to English open track, the details are given in Table 4.
The English side of the training corpus was the same as that used in the earlier Japanese to English experiments reported in this paper.
Each sentence in the evaluation corpus had seven reference translations.
Our baseline performance was slightly different from that reported in the RWTH experiments (21.9 BLEU socre for RWTH's system and 21.7 for our system).
Therefore, their improved baseline is shown for comparison.
The results are shown in Table 5.
The improvements over the baseline of our method in both BLEU and NIST (Doddington
2002) score were greater than those for both CLM and SMix.
In particular, our method showed im-provent in both the BLEU and NIST scores, this is in contrast to the CLM and SMix methods which both degraded the translation performance in terms ofthe NIST score.
Table 5: Comparison results with previous methods
6.4 Clustering of the monolingual corpus
Finally, we evaluated the proposed method when an additional monolingual corpus was incorporated.
For this experiment, we used the Chinese and English bilingual corpora that were used in the NIST MT06 evaluation (NIST 2006).
The size of the bilingual training corpus was 2.9M sentence pairs.
For the language model training, an additional monolingual corpus of 1.5M English sentences was used.
NIST 2006 development (evaluation set for NIST 2005) is used for evaluation.
In this experiment, the test set language model perplexity of a model built on only the monolingual corpus was considerably lower than that of a model built from only the target language sentences from the bilingual corpus.
Therefore, we would expect the use ofthis monolingual corpus to be an important factor affecting the quality of the translation system.
These perplexities were 299.9 for the model built on only the bilingual corpus, 200.1 for the model built on only the monolingual corpus, and 192.5 for the model built on a combination of the bilingual and monolingual corpora.
For the domain speciic models, 50 clusters were created from the bilingual and monolingual corpora.
In this experiment, only the domain speciic language model was used.
The experimental results are shown in Table 6.
The results in the table show that the incorporation of the additional monolingual data has a pronounced beneicial effect on performance, the performance improved according to all of the evaluation measures.
Table 2: Japanese to English translation evaluation scores
Table 3: Evaluation using ASR output
Domain Specific LM
Domain Specific TM
Domain Specific LM+TM
7 Conclusion
We have proposed a technique that utilizes domain speciic models based on bilingual clustering for statistical machine translation.
It is well-known that domain speciic modeling can result in better performance.
However, in many cases, the target domain is not known or can change dynamically.
In such cases, domain determination and domain speciic translation must be performed simultaneously during the translation process.
In the proposed method, a bilingual corpus was clustered using an entropy reduction based method.
The resulting bilingual clusters are regarded as domains.
Domain speciic language and translation models are created from the data within each bilingual cluster.
When a source sentence is to be translated, its domain is irst predicted.
The domain prediction method selects the cluster that assigns the lowest language model perplexity to the given source sentence.
Translation then proceeds using a language model and translation model that are speciic to the domain predicted for the source sentence.
In our experiments we used a corpus from the travel domain (the subset of the BTEC corpus that was used in IWSLT06).
Our experimental results clearly demonstrate the effectiveness of our method.
In the Japanese to English translation experiments, the use of our proposed method improved the BLEU score by 2.7 points (from 52.4 to 55.1).
We compared our approach to two previous methods, the
cluster language model and sentence mixture model.
In our experiments the proposed method yielded higher scores than either of the competitive methods in terms of both BLEU and NIST.
Moreover, our method may also be augmented when an additional monolingual corpus is avaliable for building the language model.
Using this approach we were able to further improve translation performance on the data from the NIST MT06 evaluation task.
shinshoku wa dore desu ka)
• E: are there any baseball games today
yakyu no shiai wa ari masu ka)
• E: where's the nearest perfumery
no kousui ten wa doko desu ka)
J: (choshoku wa ikura
Table 4: Training and evaluation corpora used for comparison with previous methods
# of sentence
Total words
Vocabulary size
English Training
Chinese Training
Chinese Evaluation
Table 6: Experimental results with monolingual corpus
Baseline
Proposed
• E: i'd like extension twenty four please
(furaitonanba
• E: delta airlines flight one one two boarding is delayed
B Sample of Different Translation Results
Ref: where is a police station where japanese is understood
Base: japanese where's the police station LM: japanese where's the police station TM: where's the police station where someone understands japanese
LM+TM: where's the police station where someone understands japanese
