We show that phrase structures in Penn Tree-bank style parses are not optimal for syntax-based machine translation.
We exploit a series of binarization methods to restructure the Penn Treebank style trees such that syn-tactified phrases smaller than Penn Treebank constituents can be acquired and exploited in translation.
We find that by employing the EM algorithm for determining the binarization of a parse tree among a set of alternative binarizations gives us the best translation result.
1 Introduction
Syntax-based translation models (Eisner, 2003; Galley et al., 2006; Marcu et al., 2006) are usually built directly from Penn Treebank (PTB) (Marcus et al., 1993) style parse trees by composing treebank grammar rules.
As a result, often no substructures corresponding to partial PTB constituents are extracted to form translation rules.
Syntax translation models acquired by composing treebank grammar rules assume that long rewrites are not decomposable into smaller steps.
This effectively restricts the generalization power of the induced model.
For example, suppose we have an xRs (Knight and Graehl, 2004) rule R1 in Figure 1 that translates the Chinese phrase russia minister viktor-chernomyrdin into an English NPB tree fragment yielding an English phrase.
Also suppose that we want to translate a Chinese phrase
viktor-chernomyrdin and his colleague
into English.
What we desire is that if we have another rule R2 as shown in Figure 1, we could
somehow compose it with Ri to obtain the desirable translation.
We unfortunately cannot do this because R1 and R2 are not further decomposable and their substructures cannot be re-used.
The requirement that all translation rules have exactly one root node does not enable us to use the translation of viktor-chernomyrdin in any other contexts than those seen in the training corpus.
A solution to overcome this problem is to right-binarize the left-hand side (LHS) (or the English-side) tree of R1 such that we can decompose R1 into R3 and R4 by factoring NNP(viktor) NNP(chernomyrdin) out as R4 according to the word alignments; and left-binarize the LHS ofR2 by introducing a new tree node that collapses the two NNP's, so as to generalize this rule, getting rule R5 and rule R6.
We also need to consistently syntact-ify the root labels of R4 and the new frontier label of R6 such that these two rules can be composed.
Since labeling is not a concern of this paper, we simply label new nodes with X-bar where X here is the parent label.
With all these in place, we now can translate the foreign sentence by composing R6 and R4 in Figure 1.
Binarizing the syntax trees for syntax-based machine translation is similar in spirit to generalizing parsing models via markovization (Collins, 1997; Charniak, 2000).
But in translation modeling, it is unclear how to effectively markovize the translation rules, especially when the rules are complex like those proposed by Galley et al. (2006).
In this paper, we explore the generalization ability of simple binarization methods like left-, right-, and head-binarization, and also their combinations.
Simple binarization methods binarize syntax trees in a consistent fashion (left-, right-, or head-) and
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 746-754, Prague, June 2007.
©2007 Association for Computational Linguistics
JJ NNP NNP NNP russia minister viktor chernomyrdin
RUSSIA MINISTER V-C
and his colleague NNP NNP AND HIS COLLEAGUE
russia minister
RUSSIA MINISTER
viktor chernomyrdin
colleague
COLLEAGUE
V-C AND HIS COLLEAGUE
Figure 1: Generalizing translation rules by binarizing trees.
thus cannot guarantee that all the substructures can be factored out.
For example, right binarization on the LHS of R1 makes available R4, but misses R6 on R2.
We then introduce a parallel restructuring method, that is, one can binarize both to the left and right at the same time, resulting in a binarization forest.
We employ the EM (Dempster et al., 1977) algorithm to learn the binarization bias for each tree node from the parallel alternatives.
The EM-binarization yields best translation performance.
The rest of the paper is organized as follows.
Section 2 describes related research.
Section 3 defines the concepts necessary for describing the bina-rizations methods.
Section 4 describes the tree bina-rization methods in details.
Section 5 describes the forest-based rule extraction algorithm, and section 6 explains how we restructure the trees using the EM algorithm.
The last two sections are for experiments and conclusions.
2 Related Research
Several researchers (Melamed et al., 2004; Zhang et al., 2006) have already proposed methods for bi-narizing synchronous grammars in the context of machine translation.
Grammar binarization usually maintains an equivalence to the original grammar such that binarized grammars generate the same lan-
guage and assign the same probability to each string as the original grammar does.
Grammar binarization is often employed to make the grammar fit in a CKY parser.
In our work, we are focused on binarization of parse trees.
Tree binarization generalizes the resulting grammar and changes its probability distribution.
In tree binarization, synchronous grammars built from restructured (binarized) training trees still contain non-binary, multi-level rules and thus still require the binarization transformation so as to be employed by a CKY parser.
The translation model we are using in this paper belongs to the xRs formalism (Knight and Graehl, 2004), which has been proved successful for machine translation in (Galley et al., 2004; Galley et al.,2006; Marcu etal., 2006).
3 Concepts
We focus on tree-to-string (in noisy-channel model sense) translation models.
Translation models of this type are typically trained on tuples of a source-language sentence f, a target language (e.g., English) parse tree n that yields e and translates from f, and the word alignments a between e and f. Such a tuple is called an alignment graph in (Galley et al., 2004).
The graph (1) in Figure 2 is such an alignment graph.
VIKTOR-CHERNOMYRDIN
(2) left-binarization
NNPi NNP2 NNP3 Chernomyrdin
(4) left-binarization
viktor Chernomyrdin
NNP3 Chernomyrdin
NNPi NNP2 viktor
jor chernomyrdin
Figure 2: Left, right, and head binarizations.
Heads are marked with *'s. New nonterminals introduced by binarization are denoted by X-bars.
A tree node in n is admissible if the f string covered by the node is contiguous but not empty, and if the f string does not align to any e string that is not covered by n. An xRs rule can be extracted only from an admissible tree node, so that we do not have to deal with dis-contiguous f spans in decoding (or synchronous parsing).
For example, in tree (2) in Figure 2, node NPB is not admissible because the f string that the node covers also aligns to NNP4, which is not covered by the NPB.
Node NPB in tree (3), on the other hand, is admissible.
A set of sibling tree nodes is called factorizable if we can form an admissible new node dominating them.
For example, in tree (1) of Figure 2, sibling nodes NNP2 NNP3 and NNP4 are factorizable because we can factorize them out and form a new node NPB, resulting in tree (3).
Sibling tree nodes NNPi NNP2 and NNP3 are not factorizable.
In synchronous parse trees, not all sibling nodes are fac-torizable, thus not all sub-phrases can be acquired and syntactified.
The main purpose of our paper is to restructure parse trees by factorization such that syntactified sub-phrases can be employed in translation.
4 Binarizing Syntax Trees
We are going to binarize a tree node n that dominates r children n1, nr. Restructuring will be performed by introducing new tree nodes to dominate a subset of the children nodes.
To avoid over-generalization, we allow ourselves to form only one new node at a time.
For example, in Figure 2, we can binarize tree (1) into tree (2), but we are not allowed to form two new nodes, one dominating NNP1 NNP2 and the other dominating NNP3 NNP4.
Since labeling is not the concern ofthis paper, we relabel the newly formed nodes as n.
4.1 Simple binarization methods
The left binarization of node n (i.e., the NPB in tree (1) of Figure 2) factorizes the leftmost r — 1 children by forming a new node n (i.e., NPB in tree (2)) to dominate them, leaving the last child nr untouched; and then makes the new node n the left child of n. The method then recursively left-binarizes the newly formed node n until two leaves are reached.
In Figure 2, we left-binarize tree (1) into (2) and then into (4).
first child ri\ untouched; and then makes the new node n the right child of n. The method then recursively right-binarizes the newly formed node n. In Figure 2, we right-binarize tree (1) into (3) and then into (7).
The head binarization of node n left-binarizes n if the head is the first child; otherwise, right-binarizes n. We prefer right-binarization to left-binarization when both are applicable under the head restriction because our initial motivation was to generalize the NPB-rooted translation rules.
As we will show in the experiments, binarization of other types of phrases contribute to the translation accuracy improvement as well.
Any of these simple binarization methods is easy to implement, but is incapable of giving us all the factorizable sub-phrases.
Binarizing all the way to the left, for example, from tree (1) to tree (2) and to tree (4) in Figure 2, does not enable us to acquire a substructure that yields NNP3 NNP4 and their trans-lational equivalences.
To obtain more factorizable sub-phrases, we need to parallel-binarize in both directions.
4.2 Parallel binarization
Simple binarizations transform a parse tree into another single parse tree.
Parallel binarization will transform a parse tree into a binarization forest, desirably packed to enable dynamic programming when extracting translation rules from it.
Borrowing terms from parsing semirings (Goodman, 1999), a packed forest is composed of additive forest nodes (©-nodes) and multiplicative forest nodes (<g)-nodes).
In the binarization forest, a <g>-node corresponds to a tree node in the unbinarized tree; and this <g)-node composes several ©-nodes, forming a one-level substructure that is observed in the unbinarized tree.
A ©-node corresponds to alternative ways of binarizing the same tree node in the unbinarized tree and it contains one or more < -nodes.
The same ©-node can appear in more than one place in the packed forest, enabling sharing.
Figure 3 shows a packed forest obtained by packing trees (4) and (7) in Figure 2 via the following parallel binarization algorithm.
To parallel-binarize a tree node n that has children ni,nr, we employ the following steps:
• We recursively parallel-binarize children nodes n1, nr, producing binarization ©-nodes
ffi(ni),©(nr), respectively.
• We right-binarize n, if any contiguous1 subset of children n2 ,...,nr is factorizable, by introducing an intermediate tree node labeled as n. We recursively parallel-binarize n to generate a binarization forest node ffi(n).
We form a multiplicative forest node <g)R as the parent of e(ni) and ®(n).
• We left-binarize n if any contiguous subset of n1 ,...,nr-1 is factorizable and if this subset contains n1.
Similar to the above right-binarization, we introduce an intermediate tree node labeled as n, recursively parallel-binarize n to generate a binarization forest node ®(n), form a multiplicative forest node <8> L as the parent of ©(n) ande(ni).
• We form an additive node ©(n) as the parent of the two already formed multiplicative nodes
<8> l and (&r .
The (left and right) binarization conditions consider any subset to enable the factorization of small constituents.
For example, in tree (1) of Figure 2, although NNP1 NNP2 NNP3 of NPB are not factor-izable, the subset NNP1 NNP2 is factorizable.
The binarization from tree (1) to tree (2) serves as a relaying step for us to factorize NNP1 NNP2 in tree (4).
The left-binarization condition is stricter than
'We factorize only subsets that cover contiguous spans to avoid introducing dis-contiguous constituents for practical purpose.
In principle, the algorithm works fine without this bina-rization condition.
the right-binarization condition to avoid spurious bi-narization; i.e., to avoid the same subconstituent being reached via both binarizations.
We could transform tree (1) directly into tree (4) without bothering to generate tree (3).
However, skipping tree (3) will create us difficulty in applying the EM algorithm to choose a better binarization for each tree node, since tree (4) can neither be classified as left binarization nor as right binarization of the original tree (1) — it is the result of the composition of two left-binarizations.
In parallel binarization, nodes are not always bi-narizable in both directions.
For example, we do not need to right-binarize tree (2) because NNP2 NNP3 are not factorizable, and thus cannot be used to form sub-phrases.
It is still possible to right-binarize tree (2) without affecting the correctness of the parallel binarization algorithm, but that will spuriously increase the branching factor of the search for the rule extraction, because we will have to expand more tree nodes.
A restricted version of parallel binarization is the headed parallel binarization, where both the left and the right binarization must respect the head propagation property at the same time.
A nice property of parallel binarization is that for any factorizable substructure in the unbinarized tree, we can always find a corresponding admissible ©-node in the parallel-binarized packed forest.
A leftmost substructure like the lowest NPB-subtree in tree (4) of Figure 2 can be made factorizable by several successive left binarizations, resulting in ©5 (NPB)-node in the packed forest in Figure 3.
A substructure in the middle can be factorized by the composition of several left- and right-binarizations.
Therefore, after a tree is parallel-binarized, to make the sub-phrases available to the MT system, all we need to do is to extract rules from the admissible nodes in the packed forest.
Rules that can be extracted from the original unrestructured tree can be extracted from the packed forest as well.
Parallel binarization results in parse forests.
Thus translation rules need to be extracted from training data consisting of (e-forest, f, a)-tuples.
5 Extracting translation rules from (e-forest, f, a)-tuples
The algorithm to extract rules from (e-forest, f, a)-tuples is a natural generalization of the (e-parse, f, a)-based rule extraction algorithm in (Galley et al., 2006).
The input to the forest-based algorithm is a (e-forest, f, a)-triple.
The output of the algorithm is a derivation forest (Galley et al., 2006) composed of xRs rules.
The algorithm recursively traverses the e-forest top-down and extracts rules only at admissible forest nodes.
The following procedure transforms the packed e-forest in Figure 3 into a packed synchronous derivation in Figure 4.
Condition 1: Suppose we reach an additive e-forest node, e.g. ©1(NPB) in Figure 3.
For each of ©1(NPB)'s children, e-forest nodes <g)2(NPB) and <g)n(NPB), we go to condition 2 to recursively extract rules on these two e-forest nodes, generating multiplicative derivation forest nodes, i.e., ig)(NPB(NPB : x0 NNP3(viktor) NNP4 (Chernomyrdin) 4) —  x0 V-C) and ®(NPB(NNPi NPB(NNP2 : x0 NPB : x1)) — x0 x1 x2) in Figure 4.
We make these new <g) nodes children of ©(NPB) in the derivation forest.
Condition 2: Suppose we reach a multiplicative parse forest node, i.e., <g)ii (NPB) in Figure 3.
We extract rules rooted at it using the procedure in (Galley et al., 2006), forming multiplicative derivation forest nodes, i.e., ®(NPB(NNPi NPB(NNP2 : Xq NPB : x\)) —> Xq X\ £2) We then go to condition 1 to form the derivation forest on the additive frontier e-forest nodes of the newly extracted rules, generating additive derivation forest nodes, i.e., ffi(NNPi), ©(NNP2) and ©(NPB).
We make these © nodes the children of node ®(NPB(NNPi NPB(NNP2 : x0 NPB : xi)) -»■ x0 x1 x2) in the derivation forest.
This algorithm is a natural extension ofthe extraction algorithm in (Galley et al., 2006) in the sense that we have an extra condition (1) to relay rule extraction on additive e-forest nodes.
It is worthwhile to eliminate the spuriously ambiguous rules that are introduced by the parallel bi-
Figure 4: Derivation forest.
narization.
For example, we may extract the following two rules:
These two rules, however, are not really distinct.
They both converge to the following rules if we delete the auxiliary nodes A.
The forest-base rule extraction algorithm produces much larger grammars than the tree-based one, making it difficult to scale to very large training data.
From a 50M-word Chinese-to-English parallel corpus, we can extract more than 300 million translation rules, while the tree-based rule extraction algorithm gives approximately 100 million.
However, the restructured trees from the simple binarization methods are not guaranteed to give the best trees for syntax-based machine translation.
What we desire is a binarization method that still produces single parse trees, but is able to mix left binarization and right binarization in the same tree.
In the following, we shall use the EM algorithm to learn the desirable bi-narization on the forest of binarization alternatives proposed by the parallel binarization algorithm.
6 Learning how to binarize via the EM algorithm
The basic idea of applying the EM algorithm to choose a restructuring is as follows.
We perform a set of binarization operations on a parse tree t. Each binarization // is the sequence of binarizations on the necessary (i.e., factorizable) nodes in t in pre-order.
Each binarization / results in a restructured tree rg. We extract rules from (rg, f, a), generating a translation model consisting of parameters (i.e., rule
syntax translation synchronous derivation forests model
- viterbi derivations
Figure 5: Using the EM algorithm to choose restructuring.
probabilities) 0.
Our aim is to obtain the binarization ///* that gives the best likelihood of the restructured training data consisting of (rg, f, a)-tuples.
That is
In practice, we cannot enumerate all the exponential number of binarized trees for a given e-parse.
We therefore use the packed forest to store all the binarizations that operate on an e-parse in a compact way, and then use the inside-outside algorithm (Lari and Young, 1990; Knight and Graehl, 2004) for model estimation.
Since it has been well-known that applying EM with tree fragments of different sizes causes over-fitting (Johnson, 1998), and since it is also known that syntax MT models with larger composed rules in the mix significantly outperform rules that minimally explain the training data (minimal rules) in
translation accuracy (Galley et al., 2006), we decompose p(rb, f, a) using minimal rules during running of the EM algorithm, but, after the EM restructuring is finished, we build the final translation model using composed rules for evaluation.
Figure 5 is the actual pipeline that we use for EM binarization.
We first generate a packed e-forest via parallel binarization.
We then extract minimal translation rules from the (e-forest, f, a)-tuples, producing synchronous derivation forests.
We run the inside-outside algorithm on the derivation forests until convergence.
We obtain the Viterbi derivations and project the English parses from the derivations.
Finally, we extract composed rules using Galley et al. (2006)'s (e-tree, f, a)-based rule extraction algorithm.
This procedure corresponds to the path 13*42 in the pipeline.
7 Experiments
We carried out a series of experiments to compare the performance of different binarization methods in terms of BLEU on Chinese-to-English translation tasks.
7.1 Experimental setup
Our bitext consists of 16M words, all in the mainland-news domain.
Our development set is a
We removed long sentences from the NIST02 evaluation set to speed up discriminative training.
set.
We used a bottom-up, CKY-style decoder that works with binary xRs rules obtained via a synchronous binarization procedure (Zhang et al., 2006).
The decoder prunes hypotheses using strategies described in (Chiang, 2007).
The parse trees on the English side of the bitexts were generated using a parser (Soricut, 2004) implementing the Collins parsing models (Collins, 1997).
We used the EM procedure described in (Knight and Graehl, 2004) to perform the inside-outside algorithm on synchronous derivation forests and to generate the Viterbi derivation forest.
introduced by binarization will not be counted when computing the rule size limit unless they appear as the rule roots.
The motivation is that binarization deepens the parses and increases the number of tree nodes.
In (Galley et al., 2006), a composed rule is extracted only if the number of internal nodes it contains does not exceed a limit (i.e., 4), similar to the phrase length limit in phrase-based systems.
This means that rules extracted from the restructured trees will be smaller than those from the unrestruc-tured trees, if the X nodes are deleted.
As shown in (Galley et al., 2006), smaller rules lose context, and thus give lower translation performance.
Ignoring X nodes when computing the rule sizes preserves the unstructured rules in the resulting translation model and adds substructures as bonuses.
7.2 Experiment results
Table 1 shows the BLEU scores of mixed-cased and detokenized translations of different systems.
We see that all the binarization methods improve the baseline system that does not apply any binarization algorithm.
The EM-binarization performs the best among all the restructuring methods, leading to 1.0 BLEU point improvement.
We also computed the bootstrap p-values (Riezler and Maxwell, 2005) for the pairwise BLEU comparison between the baseline system and any of the system trained from bina-rized trees.
The significance test shows that the EM binarization result is statistically significant better than the baseline system (p > 0.005), even though the baseline is already quite strong.
To our best knowledge, 37.94 is the highest BLEU score on this test set to date.
Also as shown in Table 1, the grammars trained from the binarized training trees are almost two times of the grammar size with no binarization.
The extra rules are substructures factored out by these bi-narization methods.
How many more substructures (or translation rules) can be acquired is partially determined by how many more admissible nodes each binariza-tion method can factorize, since rules are extractable only from admissible tree nodes.
According to Table 1, binarization methods significantly increase the number of admissible nodes in the training trees.
The EM binarization makes available the largest
EXPERIMENT
# ADMISSIBLE NODES IN TRAINING
left binarization right binarization head binarization EM binarization
Table 1: Translation performance, grammar size and # admissible nodes versus binarization algorithms.
BLEU scores are for mixed-cased and detokenized translations, as we usually do for NIST MT evaluations.
nonterminal
left-binariz ation
right-binariz ation
Table 2: Binarization bias learned by EM.
number of admissible nodes, and thus results in the most rules.
The EM binarization factorizes more admissible nodes because it mixes both left and right binariza-tions in the same tree.
We computed the binarization biases learned by the EM algorithm for each nonterminal from the binarization forest ofheaded-parallel binarizations of the training trees, getting the statistics in Table 2.
Of course, the binarization bias chosen by left-/right-binarization methods would be 100% deterministic.
One noticeable message from Table 2 is that most of the categories are actually biased toward left-binarization, although our motivating example in our introduction section is for NPB, which needed right binarization.
The main reason might be that the head sub-constituents of most categories tend to be on the left, but according to the performance comparison between head binarization and EM binarization, head binarization does not suffice because we still need to choose the binarization between left and right if they both are head binariza-tions.
8 Conclusions
In this paper, we not only studied the impact of simple tree binarization algorithms on the performance of end-to-end syntax-based MT, but also proposed binarization methods that mix more than one simple binarization in the binarization of the same parse tree.
Binarizing a tree node whether to the left or to the right was learned by employing the EM algorithm on a set of alternative binarizations and by choosing the Viterbi one.
The EM binarization method is informed by word alignments such that unnecessary new tree nodes will not be "blindly" introduced.
To our best knowledge, our research is the first work that aims to generalize a syntax-based translation model by restructuring and achieves significant improvement on a strong baseline.
Our work differs from traditional work on binarization of synchronous grammars in that we are not concerned with the equivalence of the binarized grammar to the original grammar, but intend to generalize the original grammar via restructuring of the training parse trees to improve translation performance.
Acknowledgments
The authors would like to thank David Chiang, Bryant Huang, and the anonymous reviewers for their valuable feedbacks.
