Reordering model is important for the statistical machine translation (SMT).
Current phrase-based SMT technologies are good at capturing local reordering but not global reordering.
This paper introduces syntactic knowledge to improve global reordering capability of SMT system.
Syntactic knowledge such as boundary words, POS information and dependencies is used to guide phrase reordering.
Not only constraints in syntax tree are proposed to avoid the reordering errors, but also the modification of syntax tree is made to strengthen the capability of capturing phrase reordering.
Furthermore, the combination of parse trees can compensate for the reordering errors caused by single parse tree.
Finally, experimental results show that the performance of our system is superior to that of the state-of-the-art phrase-based SMT system.
Introduction
In the last decade, statistical machine translation (SMT) has been widely studied and achieved good translation results.
Two kinds of SMT system have been developed, one is phrase-based SMT and the other is syntax-based SMT.
In phrase-based SMT systems (Koehn et al., 2003; Koehn, 2004), foreign sentences are firstly segmented into phrases which consists of adjacent words.
Then source phrases are translated into target phrases respectively according to knowledge usually learned from bilingual parallel corpus.
Fi-
nally the most likely target sentence based on a certain statistical model is inferred by combining and reordering the target phrases with the aid of search algorithm.
On the other hand, syntax-based SMT systems (Liu et al., 2006; Yamada et al., 2001) mainly depend on parse trees to complete the translation of source sentence.
DNP ADVP VP
the significant
Figure 1: A reordering example
As studied in previous SMT projects, language model, translation model and reordering model are the three major components in current SMT systems.
Due to the difference between the source and target languages, the order of target phrases in the target sentence may differ from the order of source phrases in the source sentence.
To make the translation results be closer to the target language style, a mathematic model based on the statistic theory is constructed to reorder the target phrases.
This statistic model is called as reordering model.
As shown in Figure 1, the order of the translations of "and "fft" is changed.
The order of the
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 533-540, Prague, June 2007.
©2007 Association for Computational Linguistics
translation of "E7n/W" and "Xfe/îHf " is altered as well.
The former reordering case with the smaller distance is usually referred as local reordering and the latter with the longer distance reordering as global reordering.
Phrase-based SMT system can effectively capture the local word reordering information which is common enough to be observed in training data.
But it is hard to model global phrase reordering.
Although syntactic knowledge used in syntax-based SMT systems can help reorder phrases, the resulting model is usually much more complicated than a phrase-based system.
There have been considerable amount of efforts to improve the reordering model in SMT systems, ranging from the fundamental distance-based distortion model (Och and Ney, 2004; Koehn et al., 2003), flat reordering model (Wu, 1996; Zens et al., 2004; Kumar et al., 2005), to lexicalized reordering model (Tillmann, 2004; Kumar et al., 2005; Koehn et al., 2005), hierarchical phrase-based model (Chiang, 2005), and maximum entropy-based phrase reordering model (Xiong et al., 2006).
Due to the absence of syntactic knowledge in these systems, the ability to capture global reordering knowledge is not powerful.
Although syntax-based SMT systems (Yamada et al., 2001; Quirk et al., 2005; Liu et al., 2006) are good at modeling global reordering, their performance is subject to parsing errors to a large extent.
In this paper, we propose a new method to improve reordering model by introducing syntactic information.
Syntactic knowledge such as boundary of sub-trees, part-of-speech (POS) and dependency relation is incorporated into the SMT system to strengthen the ability to handle global phrase reordering.
Our method is different from previous syntax-based SMT systems in which the translation process was modeled based on specific syntactic structures, either phrase structures or dependency relations.
In our system, syntactic knowledge is used just to decide where we should combine adjacent phrases and what their reordering probability is.
For example, according to the syntactic information in Figure 1, the phrase translation combination should take place between and "îî'fS" rather than between "W" and "Xfe".
Moreover, the non-monotone phrase reordering should occur between "E7n/W" and "Xfe/îîîÈ." rather than between "E^/W" and "Xfe".
We train a maxi-
mum entropy model, which is able to integrate rich syntactic knowledge, to estimate phrase reordering probabilities.
To enhance the performance of phrase reordering model, some modification on the syntax trees are also made to relax the phrase reordering constraints.
Additionally, the combination of other kinds of syntax trees is introduced to overcome the deficiency of single parse tree.
The experimental results show that the performance of our system is superior to that of the state-of-art phrase-based SMT system.
The roadmap of this paper is: Section 2 gives the related work.
Section 3 introduces our model.
Section 4 explains the generalization of reordering knowledge.
The procedures of training and decoding are described in Section 5 and Section 6 respectively.
The experimental results are shown in Section 7.
Section 8 concludes the paper.
2 Related Work
The Pharaoh system (Koehn et al., 2004) is well known as the typical phrase-based SMT system.
Its reordering model is designed to penalize translation according to jump distance regardless of linguistic knowledge.
This method just works well for language pairs that trend to have similar wordorders and it has nothing to do with global reordering.
A straightforward reordering model used in (Wu, 1996; Zens et al., 2004; Kumar et al., 2005) is to assign constant probabilities to monotone reordering and non-monotone reordering, which can be flexible depending on the different language pairs.
This method is also adopted in our system for nonpeer phrase reordering.
2005) .
Their work made a step forward in integrating linguistic knowledge to capture reordering.
But their methods have the serious data sparseness problem.
Beyond standard phrase-based SMT system, a CKY style decoder was developed in (Xiong et al.,
2006) .
Their method investigated the reordering of any two adjacent phrases.
The limited linguistic knowledge on the boundary words of phrases is used to construct the phrase reordering model.
The basic difference to our method is that no syntactic knowledge is introduced to guide the global phrase reordering in their system.
Besides boundary
words, our phrase reordering model also integrates more significant syntactic knowledge such as POS information and dependencies from the syntax tree, which can avoid some intractable phrase reordering errors.
A hierarchical phrase-based model was proposed by (Chiang, 2005).
In his method, a synchronous CFG is used to reorganize the phrases into hierarchical ones and grammar rules are automatically learned from corpus.
Different from his work, foreign syntactic knowledge is introduced into the synchronous grammar rules in our method to restrict the arbitrary phrase reordering.
Syntax-based SMT systems (Yamada et al., 2001; Quirk et al., 2005; Liu et al., 2006) totally depend on syntax structures to complete phrase translation.
They can capture global reordering by simply swapping the children nodes of a parse tree.
However, there are also reordering cases which do not agree with syntactic structures.
Furthermore, their model is usually much more complex than a phrase-based system.
Our method exactly attempts to integrate the advantages of phrase-based SMT system and syntax-based SMT system to improve the phrase reordering model.
Phrase translation in our system is independent of syntactic structures.
In our work, we focus on building a better reordering model with the help of source parsing information.
Although we borrow some fundamental elements from a phrase-based SMT system such as the use of bilingual phrases as basic translation unit, we are more interested in introducing syntactic knowledge to strengthen the ability to handle global reordering phenomena in translation.
Given a foreign sentence f and its syntactic parse tree T, each leaf in T corresponds to a single word in f and each sub-tree of T exactly covers a phrase f in f which is called as linguistic phrase.
Except linguistic phrases, any other phrase is regarded as non-linguistic phrase.
The height of phrase fi is defined as the distance between the root node of T and the root node of the maximum sub-tree which exactly covers fi.
For example, in Figure 1 the phrase "^te" has the maximum sub-tree rooting at ADJP and its height is 3.
The height of phrase " Eft" is 4 since its maximum sub-tree roots at
ADBP instead of AD.
If two adjacent phrases have the same height, we regard them as peer phrases.
In our model, we make use of bilingual phrases as well, which refer to source-target aligned phrase pairs extracted using the same criterion as most phrase-based systems (Och and Ney, 2004).
Similar to the work in Chiang (2005), our translation model can be formulated as a weighted synchronous context free grammar derivation process.
Let D be a derivation that generates a bilingual sentence pair f e), in which f is the given source sentence, the statistical model that is used to predict the translation probability p(ef is defined over Ds as follows:
where plm(e) is the language model, 0i(X -¥{y,a)) is a feature function defined over the derivation ruleX-${y,a), and Ai is its weight.
Although theoretically it is ideal for translation reorder modeling by constructing a synchronous context free grammar based on bilingual linguistic parsing trees, it is generally a very difficult task in practice.
In this work we propose to use a small synchronous grammar constructed on the basis of bilingual phrases to model translation reorder probability and constraints by referring to the source syntactic parse trees.
In the grammar, the source / target words serve as terminals, and the bilingual phrases and combination of bilingual phrases are presented with non-terminals.
There are two non-terminals in the grammar except the start symbol S: Y and Z. The general derivation rules are defined as follows:
a) Derivations from non-terminal to nonterminals are restricted to binary branching forms;
b) Any non-terminals that derives a list of terminals, or any combination of two non-terminals, if the resulting source string won't cause any cross-bracketing problems in the source parse tree (it exactly corresponds to a linguistic phrase in binary parse trees), are reduced to Y;
c) Otherwise, they are reduced to Z.
Table 1 shows a complete list of derivation rules in our synchronous context grammar.
The first nine grammar rules are used to constrain phrase reor-
dering during phrase combination.
The last two rules are used to represent bilingual phrases.
Rule (10) is the start grammar rule to generate the entire sentence translation.
Rule Name
Rule Content
Table 1: Synchronous grammar rules
Rule (1) and Rule (2) are only applied to two adjacent peer phrases.
Note that, according to the constraints of foreign syntactic structures, only Rule (2) among all rules in Table 1 can be applied to conduct non-monotone phrase reordering in our framework.
This can avoid arbitrary phrase reordering.
For example, as shown in Figure 1, Rule (1) is applied to the monotone combination of phrases "E7n" and "ft", and Rule (2) is applied to the non-monotone combination of phrases "E7n/ft" and "^fe/ifHil".
However, the non-monotone combination of "ft" and "^fe" is not allowed in our method since there is no proper rule for it.
Non-linguistic phrases are involved in Rule (3)~(9).
We do not allow these grammar rules for non-monotone combination of non-peer phrases, which really harm the translation results as proved in experimental results.
Although these rules violate the syntactic constraints, they not only provide the option to leverage non-linguistic translation knowledge to avoid syntactic errors but also take advantage of phrase local reordering capabili-
ties.
Rule (3) and Rule (8) are applied to the combination of two adjacent non-linguistic phrases.
Rule (4)~(7) deal with the situation where one is a linguistic phrase and the other is a non-linguistic phrase.
Rule (9) is applied to the combination of two adjacent linguistic phrases but their combination result is not a linguistic phrase.
Rule (11) and Rule (12) are applied to generate bilingual phrases learned from training corpus.
Table 2 demonstrates an example how these rules are applied to translate the foreign sentence " E7n/ft / ^fe/^ft" into the English sentence "the significant appreciation of the Euro".
Partial derivations
Table 2: Example of application for rules
However, there are always other kinds of bilingual phrases extracted directly from training corpus, such as < E7n, the Euro ) and <ft ^fe J\ if, 's significant appreciation ), which can produce different candidate sentence translations.
Here, the phrase "ft i^HlL" is a non-linguistic phrase.
The above derivations can also be rewritten as <Y1, Y1<Y2Z3,Y2Z3)—< E7C Z3, the Euro Z3<E^ft ^fe ^ft, the Euro 's significant appreciation , where Rule (10), (4), (12) and (11) are applied respectively.
Similar to the default features in Pharaoh (Koehn, Och and Marcu 2003), we used following features to estimate the weight of our grammar rules.
Note
that different rules may have different features in our model.
• The lexical weights piex(/\a) and piex(a\y) estimating how well the words in a translate the words in y. This feature is only applicable to Rule (11) and Rule (12).
• The phrase translation weights pphr(ya) and pphr(a\y) estimating how well the terminal words of a translate the terminal words of y, This feature is only applicable to Rule (11) and
Rule (12).
• A word penalty exp(\a\), where \a\ denotes the count of terminal words of a. This feature is only applicable to Rule (11) and Rule (12).
• A penalty exp(1) for grammar rules analogous to Pharaoh's penalty which allows the model to learn a preference for longer or shorter derivations.
This feature is applicable to all rules in
Table 1.
• Score for applying the current rule.
This feature is applicable to all rules in Table 1.
We will explain the score estimation in detail in Section 3.4.
Based on the syntax constraints and involved nonterminal types, we separate the grammar rules into three groups to estimate their application scores which are also treated as reordering probabilities.
For Rule (1) and Rule (2), they strictly comply with the syntactic structures.
Given two peer phrases, we have two choices to use one of them.
Thus, we use maximum entropy (ME) model algorithm to estimate their reordering probabilities separately, where the boundary words of foreign phrases and candidate target translation phrases, POS information and dependencies are integrated as features.
As listed in Table 3, there are totally twelve categories of features used to train the ME model.
In fact, the probability of Rule (1) is just equal to the supplementary probability of Rule (2), and vice versa.
For Rule (3)~(9), according to the syntactic structures, their application is determined since there is only one choice to complete reordering, which is similar to the "glue rules" in Chiang (2005).
Due to the appearance of non-linguistic phrases, non-monotone phrase reordering is not allowed in these rules.
We just assign these rules a constant score trained using our implementation of
Minimum Error Rate Training (Och, 2003b), which is 0.7 in our system.
For Rule (10)~(12), they are also determined rules since there is no other optional rules competing with them.
Constant score is simply assigned to them as well, which is 1.0 in our system.
Fea.
Description
First word of first foreign phrase
First word of second foreign phrase
Last word of first foreign phrase
Last word of second foreign phrase
First word of first target phrase
First word of second target phrase
Last word of first target phrase
Last word of second target phrase
POS of the node covering first foreign phrase
POS of the node covering second foreign phrase
POS of the node covering the combination of foreign phrases
Dependency between the nodes covering two single foreign phrases respectively
Table 3 : Feature categories used for ME model
4 The Generalization of Reordering Knowledge
4.1 Enriching Parse Trees
though its component phrases "Mie*" and "r cm" are peer phrases.
To avoid the conflict with the Rule (2), we just add some extra virtual nodes in the n-ary sub-trees to make sure that only binary sub-trees survive in the modified parse tree.
Figure 2(b) is the modification result of the syntactic tree from Figure 2(a), where two virtual nodes with the new distinguishable POS of M are added.
In general, we add virtual nodes for each set of the continuous peer phrases and let them have the same height.
Thus, for a n-ary sub-tree, there are
where n>2.
The phrases exactly covered by the virtual nodes are called as virtual peer phrases.
Figure 2: Example of syntax tree modification 4.2 Combination of Parse Trees
It is well known that parse errors in syntactic trees always are inescapable even if the state-of-the-art parser is used.
Incorrect syntactic knowledge may harm the reordering probability estimation.
To minimize the impact of parse error of a single tree, more parse trees are introduced.
To support the combination of parse trees, the synchronous grammar rules are applied independently, but they will compete against each other with the effect of other models such as language model.
In our system, we combine the parse trees generated respectively by Stanford parser (Klein, 2003) and a dependency parser developed by (Zhou, 2000).
Compared with the Stanford parser, the dependency parser only conducts shallow syntactic analysis.
It is powerful to identify the base NPs and base VPs and their dependencies.
Additionally, dependency parser runs much faster.
For example, it took about three minutes for the dependency parser to parse one thousand sentences with aver-
age length of 25 words, but the Stanford parser needs about one hour to complete the same work.
More importantly, as shown in the experimental results, the dependency parser can achieve the comparable quality of final translation results with Stanford parser in our system.
The Decoder
We developed a CKY style decoder to complete the sentence translation.
A two-dimension array CA is constructed to store all the local candidate phrase translation and each valid cell CAij in CA corresponds to a foreign phrase where i is the phrase start position and j is the phrase end position.
The cells in CA are filled in a bottom-up way.
Firstly we fill in smaller cells with the translation in bilingual phrases learned from corpus.
Then the candidate translation in the larger cell CAij is generated based on the content in smaller adjacent cells CAik and CAk+1j with the monotone combination and non-monotone combination, where i<k<j. To reduce the cost of system resources, the well known pruning methods, such as histogram pruning, threshold pruning and recombination, are used to only keep the top N candidate translation in each
cell.
6 Training
Similar to most state-of-the-art phrase-based SMT systems, we use the SRI toolkit (Stolcke, 2002) for language model training and Giza++ toolkit (Och and Ney, 2003) for word alignment.
For reordering model training, two kinds of parse trees for each foreign sentence in the training corpus were obtained through the Stanford parser (Klein, 2003) and a dependency parser (Zhou, 2000).
After that, we picked all the foreign linguistic phrases of the same sentence according to syntactic structures.
Based on the word alignment results, if the aligned target words of any two adjacent foreign linguistic phrases can also be formed into two valid adjacent phrase according to constraints proposed in the phrase extraction algorithm by Och (2003a), they will be extracted as a reordering training sample.
Finally, the ME modeling toolkit developed by Zhang (2004) is used to train the reordering model over the extracted samples.
7 Experimental Results and Analysis
We conducted our experiments on Chinese-to-English translation task of NIST MT-05 on a 3.0GHz system with 4G RAM memory.
The bilingual training data comes from the FBIS corpus.
The Xinhua news in GIGAWORD corpus is used to train a four-gram language model.
The development set used in our system is the NIST MT-02 evaluation test data.
For phrase extraction, we limit the maximum length of foreign and English phrases to 3 and 5 respectively.
But there is no phrase length constraint for reordering sample extraction.
About 1.93M and 1.1M reordering samples are extracted from the FBIS corpus based on the Stanford parser and the dependency parser respectively.
To reduce the search space in decoder, we set the histogram pruning threshold to 20 and relative pruning threshold to 0.1.
In the following experiments, we compared our system performance with that of the other state-of-the-art systems.
Additionally, the effect of some strategies on system performance is investigated as well.
Case-sensitive BLEU-4 score is adopted to evaluate system performance.
7.1 Comparing with Baseline SMT system
Our baseline system is Pharaoh (Koehn, 2004).
Xiong's system (Xiong, et al., 2006) which used ME model to train the reordering model is also regarded as a competitor.
To have a fair comparison, we used the same language model and translation model for these three systems.
The experimental results are showed in Table 4.
Xiong's System
Our System
Table 4: Performance against baseline system
These three systems are the same in that the final sentence translation results are generated by the combination of local phrase translation.
Thus, they are capable of local reordering but not global reordering.
The phrase reordering in Pharaoh depends only on distance distortion information which does not contain any linguistic knowledge.
The experi-
mental result shows that the performance of both Xiong's system and our system is better than that of Pharaoh.
It proves that linguistic knowledge can help the global reordering probability estimation.
Additionally, our system is superior to Xiong's system in which only use phrase boundary words to guide global reordering.
It indicates that syntactic knowledge is more powerful to guide global reordering than boundary words.
On the other hand, it proves the importance of syntactic knowledge constraints in avoiding the arbitrary phrase reordering.
7.2 Syntactic Error Analysis
Rule (3)~(9) in Section 3 not only play the role to compensate for syntactic errors, but also take the advantage of the capability of capturing local phrase reordering.
However, the non-monotone combination for non-peer phrases is really harmful to system performance.
To prove these ideas, we conducted experiments with different constrains.
Constraints
All rules in Table 1 used
Allowing the non-monotone
combination of non-peer phrases
Table 5 : About non-peer phrase combination
From the experimental results shown in Table 5, just as claimed in other previous work, the combination between non-linguistic phrases is useful and cannot be abandoned.
On the other hand, if we relax the constraint of non-peer phrase combination (that is, allowing non-monotone combination for on-peer phrases), some more serious errors in non-syntactic knowledge is introduced, thereby degrading performance from 0.2737 to 0.2647.
7.3 Effect of Virtual Peer Phrases
As discussed in Section 4, for n-ary nodes (n>2) in the original syntax trees, the relationship among nary sub-trees is always not clearly captured.
To give them the chance of free reordering, we add the virtual peer nodes to make sure that the combination of a set of peer phrases can still be a peer phrase.
An experiment was done to compare with the case where the virtual peer nodes were not added to n-ary syntax trees.
The Bleu score
dropped to 26.20 from 27.37, which shows the virtual nodes have great effect on system performance.
In this section, we conducted three experiments to investigate the effect of constituency parse tree and dependency parse tree.
Over the same platform, we tried to use only one of them to complete the translation task.
The experimental results are shown in
Table 6.
Surprisingly, there is no significant difference in performance.
The reason may be that both parsers produce approximately equivalent parse results.
However, the combination of syntax trees outperforms merely only one syntax tree.
This suggests that the N-best syntax parse trees may enhance the quality of reordering model.
Situation
Bleu Score
Dependency parser only
Stanford parser only
Mixed parsing trees
Table 6: Different parsing tree
8 Conclusion and Future Work
In this paper, syntactic knowledge is introduced to capture global reordering of SMT system.
This method can not only inherit the advantage of local reordering ability of standard phrase-based SMT system, but also capture the global reordering as the syntax-based SMT system.
The experimental results showed the effectiveness of our method.
In the future work, we plan to improve the reordering model by introducing N-best syntax trees and exploiting richer syntactic knowledge.
