We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model.
The trigram pass closes most of the performance gap between a bigram decoder and a much slower trigram decoder, but takes time that is insignificant in comparison to the bigram pass.
An additional fast decoding pass maximizing the expected count of correct translation hypotheses increases the BLEU score significantly.
1 Introduction
Statistical machine translation systems based on synchronous grammars have recently shown great promise, but one stumbling block to their widespread adoption is that the decoding, or search, problem during translation is more computationally demanding than in phrase-based systems.
This complexity arises from the interaction of the tree-based translation model with an n-gram language model.
Use of longer n-grams improves translation results, but exacerbates this interaction.
In this paper, we present three techniques for attacking this problem in order to obtain fast, high-quality decoders.
First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model, and the second pass expands these states into trigram-based
states.
The general bigram-to-trigram technique is common in speech recognition (Murveit et al., 1993), where lattices from a bigram-based decoder are re-scored with a trigram language model.
We examine the question of whether, given the reordering inherent in the machine translation problem, lower order n-grams will provide as valuable a search heuristic as they do for speech recognition.
Second, we explore heuristics for agenda-based search, and present a heuristic for our second pass that combines precomputed language model information with information derived from the first pass.
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
Third, given the significant speedup in the agenda-based trigram decoding pass, we can rescore the trigram forest to maximize the expected count of correct synchronous constituents of the model, using the product of inside and outside probabilities.
Maximizing the expected count ofsynchronous constituents approximately maximizes BLEU.
We find a significant increase in BLEU in the experiments, with minimal additional time.
2 Language Model Integrated Decoding for SCFG
We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space.
A synchronous CFG (SCFG) is a set of context-free rewriting rules for recursively generating string pairs.
Each synchronous rule is a pair of CFG rules
with the nonterminals on the right hand side of one CFG rule being one-to-one mapped to the otherCFG rule via a permutation n. We adopt the SCFG notation of Satta and Peserico (2005).
Superscript indices in the right-hand side of grammar rules:
indicate that the nonterminals with the same index are linked across the two languages, and will eventually be rewritten by the same rule application.
Each Xi is a variable which can take the value of any nonterminal in the grammar.
In this paper, we focus on binary SCFGs and without loss of generality assume that only the pre-terminal unary rules can generate terminal string pairs.
Thus, we are focusing on Inversion Transduc-tion Grammars (Wu, 1997) which are an important subclass of SCFG.
Formally, the rules in our grammar include preterminal unary rules:
for pairing up words or phrases in the two languages and binary production rules with straight or inverted orders that are responsible for building up upper-level synchronous structures.
They are straight rules written:
X — (YZ).
Most practical non-binary SCFGs can be bina-rized using the synchronous binarization technique by Zhang et al. (2006).
The Hiero-style rules of (Chiang, 2005), which are not strictly binary but binary only on nonterminals:
can be handled similarly through either offline binarization or allowing a fixed maximum number of gap words between the right hand side nonterminals in the decoder.
For these reasons, the parsing problems for more realistic synchronous CFGs such as in Chiang (2005) and Galley et al. (2006) are formally equivalent to ITG.
Therefore, we believe our focus on ITG
for the search efficiency issue is likely to generalize to other SCFG-based methods.
Without an n-gram language model, decoding using SCFG is not much different from CFG parsing.
At each time a CFG rule is applied on the input string, we apply the synchronized CFG rule for the output language.
From a dynamic programming point of view, the DP states are X[i, j], where X ranges over all possible nonterminals and i and j range over 0 to the input string length Each state stores the best translations obtainable.
When we reach the top state S[0, we can get the best translation for the entire sentence.
The algorithm is O(|w|3).
However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized.
Now the adjacent span-parameterized states Xand X] can interact with each other by "peeping into" the leading and trailing n — 1 words on the output side for each state.
Different boundary words differentiate the span-parameterized states.
Thus, to preserve the dynamic programming property, we need to refine the states by adding the boundary words into the parameterization.
The LM-integrated states are represented as X[i, j,Since the number of variables involved at each DP step has increased to 3 + 4(n — 1), the decoding algorithm is asymptotically 0(|w|3+4(ra_1)).
Although it is possible to use the "hook" trick of Huang et al. (2005) to factor-ize the DP operations to reduce the complexity to 0(|w|3+3(ra_1)), when n is greater than 2, the complexity is still prohibitive.
3 Multi-pass LM-Integrated Decoding
In this section, we describe a multi-pass progressive decoding technique that gradually augments the LM-integrated states from lower orders to higher orders.
For instance, a bigram-integrated state [X, i, j, u, v] is said to be a coarse-level state of a trigram-integrate state [X, i, j, u, u',v',v], because the latter state refines the previous by specifying more inner words.
eral idea is to use a simple and fast decoding algorithm to constrain the search space of a following more complex and slower technique.
More specifically, a bigram decoding pass is executed forward and backward to figure out the probability of each state.
Then the states can be pruned based on their global score using the product of inside and outside probabilities.
The advanced decoding algorithm will use the constrained space (a lattice in the case of speech recognition) as a grammatical constraint to help it focus on a smaller search space on which more discriminative features are brought in.
The same idea has been applied to forests for parsing.
Charniak and Johnson (2005) use a PCFG to do a pass of inside-outside parsing to reduce the state space of a subsequent lexicalized n-best parsing algorithm to produce parses that are further re-ranked by a MaxEnt model.
We take the same view as in speech recognition that a trigram integrated model is a finer-grained model than bigram model and in general we can do an n — 1 -gram decoding as a predicative pass for the following n-gram pass.
We need to do inside-outside parsing as coarse-to-fine parsers do.
However, we use the outside probability or cost information differently.
We do not combine the inside and outside costs of a simpler model to prune the space for a more complex model.
Instead, for a given finer-gained state, we combine its true inside cost with the outside cost of its coarse-level counter-part to estimate its worthiness of being explored.
The use of the outside cost from a coarser-level as the outside estimate makes our method naturally fall in the framework of A* parsing.
Klein and Manning (2003) describe an A* parsing framework for monolingual parsing and admissible outside estimates that are computed using inside/outside parsing algorithm on simplified PCFGs compared to the original PCFG.
Zhang and Gildea (2006) describe A* for ITG and develop admissible heuristics for both alignment and decoding.
Both have shown the effectiveness of A* in situations where the outside estimate approximates the true cost closely such as when the sentences are short.
For decoding long sentences, it is difficult to come up with good admissible (or inadmissible) heuristics.
If we can afford a bigram decoding pass, the outside cost from a bigram model is conceivably a
very good estimate of the outside cost using a tri-gram model since a bigram language model and a trigram language model must be strongly correlated.
Although we lose the guarantee that the bigram-pass outside estimate is admissible, we expect that it approximates the outside cost very closely, thus very likely to effectively guide the heuristic search.
3.1 Inside-outside Coarse Level Decoding
We describe the coarse level decoding pass in this section.
The decoding algorithms for the coarse level and the fine level do not necessarily have to be the same.
The fine level decoding algorithm is an A* algorithm.
The coarse level decoding algorithm can be CKY or A* or other alternatives.
Conceptually, the algorithm is finding the shortest hyperpath in the hypergraph in which the nodes are states like X[i, j, u1,..,ra_1, v1,..,ra_1], and the hy-peredges are the applications of the synchronous rules to go from right-hand side states to left-hand side states.
The root of the hypergraph is a special node S'[0, (s), (/s)] which means the entire input sentence has been translated to a string starting with the beginning-of-sentence symbol and ending at the end-of-sentence symbol.
Ifwe imagine a starting node that goes to all possible basic translation pairs, i.e., the instances of the terminal translation rules for the input, we are searching the shortest hyper path from the imaginary bottom node to the root.
To help our outside parsing pass, we store the backpointers at each step ofexploration.
The outside parsing pass, however, starts from the root S(s), (/s)] and follows the back-pointers downward to the bottom nodes.
The nodes need to be visited in a topological order so that whenever a node is visited, its parents have been visited and its outside cost is over all possible outside parses.
The algorithm is described in pseudocode in Algorithm 1.
The number of hyperedges to traverse is much fewer than in the inside pass because not every state explored in the bottom up inside pass can finally reach the goal.
As for normal outside parsing, the operations are the reverse of inside parsing.
We propagate the outside cost of the parent to its children by combining with the inside cost of the other children and the interaction cost, i.e., the language model cost between the focused child and the other children.
Since we want to approximate the Viterbi
outside cost, it makes sense to maximize over all possible outside costs for a given node, to be consistent with the maximization ofthe inside pass.
For the nodes that have been explored in the bottom up pass but not in the top-down pass, we set their outside cost to be infinity so that their exploration is preferred only when the viable nodes from the first pass have all been explored in the fine pass.
3.2 Heuristics for Fine-grained Decoding
In this section, we summarize the heuristics for finer level decoding.
The motivation for combining the true inside cost of the fine-grained model and the outside estimate given by the coarse-level parsing is to approximate the true global cost of a fine-grained state as closely as possible.
We can make the approximation even closer by incorporating local higherorder outside n-gram information for a state of X[i, j,ui,„,n_i,vi,„,n_i] into account.
We call this the best-border estimate.
For example, the best-border estimate for trigram states is:
where S(i, j) is the set of candidate target language words outside the span of (i, j). hBB is the product of the upper bounds for the two on-the-border n-grams.
This heuristic function was one of the admissible heuristics used by Zhang and Gildea (2006).
The benefit of including the best-border estimate is to refine the outside estimate with respect to the inner words which refine the bigram states into the trigram states.
If we do not take the inner words into consideration when computing the outside cost, all states that map to the same coarse level state would have the same outside cost.
When the simple best-border estimate is combined with the coarse-level outside estimate, it can further boost the search as will be shown in the experiments.
To summarize, our recipe
for faster decoding is that using
where // is the Viterbi inside cost and a is the Viterbi outside cost, to globally prioritize the n-gram integrated states on the agenda for exploration.
3.3 Alternative Efficient Decoding Algorithms
The complexity of n-gram integrated decoding for SCFG has been tackled using other methods.
The hook trick of Huang et al. (2005) factor-izes the dynamic programming steps and lowers the asymptotic complexity of the n-gram integrated decoding, but has not been implemented in large-scale systems where massive pruning is present.
The cube-pruning by Chiang (2007) and the lazy cube-pruning of Huang and Chiang (2007) turn the computation ofbeam pruning ofCYK decoders into a top-k selection problem given two columns of translation hypotheses that need to be combined.
The insight for doing the expansion top-down lazily is that there is no need to uniformly explore every cell.
The algorithm starts with requesting the first best hypothesis from the root.
The request translates into requests for the k-bests of some of its children and grandchildren and so on, because re-ranking at each node is needed to get the top ones.
Venugopal et al. (2007) also take a two-pass decoding approach, with the first pass leaving the language model boundary words out of the dynamic programming state, such that only one hypothesis is retained for each span and grammar symbol.
4 Decoding to Maximize BLEU
The ultimate goal of efficient decoding to find the translation that has a highest evaluation score using the least time possible.
Section 3 talks about utilizing the outside cost of a lower-order model to estimate the outside cost of a higher-order model, boosting the search for the higher-order model.
By doing so, we hope the intrinsic metric of our model agrees with the extrinsic metric of evaluation so that fast search for the model is equivalent to efficient decoding.
But the mismatch between the two is evident, as we will see in the experiments.
In this section,
Algorithm 1 OutsideCoarseParsing()
end if end for end for
we deal with the mismatch by introducing another decoding pass that maximizes the expected count of synchronous constituents in the tree corresponding to the translation returned.
BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes
BLEU.
Kumar and Byrne (2004) proposed the framework of Minimum Bayesian Risk (MBR) decoding that minimizes the expected loss given a loss function.
Their MBR decoding is a reranking pass over an n-best list of translations returned by the decoder.
Our algorithm is another dynamic programming decoding pass on the trigram forest, and is similar to the parsing algorithm for maximizing expected labelled recall presented by Goodman (1996).
4.1 Maximizing the expected count of correct synchronous constituents
We introduce an algorithm that maximizes the expected count of correct synchronous constituents.
Given a synchronous constituent specified by the state [X, i, j, u, u', v', v], its probability of being correct in the model is
where a is the outside probability and / is the inside probability.
We approximate / and a using the Viterbi probabilities.
Since decoding from bottom up in the trigram pass already gives us the inside Viterbi scores, we only have to visit the nodes in the reverse order once we reach the root to compute the Viterbi outside scores.
The outside-pass Algorithm 1 for bigram decoding can be generalized to the trigram case.
We want to maximize over all translations (synchronous trees) T in the forest after the trigram decoding pass according to
max EC ([X, i, j, u, u', v', v]).
The expression can be factorized and computed using dynamic programming on the forest.
5 Experiments
We did our decoding experiments on the LDC 2002 MT evaluation data set for translation of Chinese newswire sentences into English.
The evaluation data set has 10 human translation references for each sentence.
There are a total of371 Chinese sentences of no more than 20 words in the data set.
These sentences are the test set for our different versions of language-model-integrated ITG decoders.
We evaluate the translation results by comparing them against the reference translations using the BLEU metric.
The word-to-word translation probabilities are from the translation model of IBM Model 4 trained on a 160-million-word English-Chinese parallel corpus using GIZA++.
The phrase-to-phrase translation probabilities are trained on 833K parallel sentences.
758K of this was data made available by ISI, and another 75K was FBIS data.
The language model is trained on a 30-million-word English corpus.
The rule probabilities for ITG are trained using EM on a corpus of 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words.
5.1 Bigram-pass Outside Cost as Trigram-pass Outside Estimate
We first fix the beam for the bigram pass, and change the outside heuristics for the trigram pass to show the difference before and after using the first-pass outside cost estimate and the border estimate.
We choose the beam size for the CYK bigram pass to be 10 on the log scale.
The first row of Table 1 shows the number of explored hyperedges for the bigram pass and its BLEU score.
In the rows below, we compare the additional numbers of hyperedges that need to be explored in the trigram pass using different outside heuristics.
It takes too long to finish using uniform outside estimate; we have to use a tight beam to control the agenda-based exploration.
Using the bigram outside cost estimate makes a huge difference.
Furthermore, using Equation 1, adding the additional heuristics on the best trigrams that can appear on the borders of the current hypothesis, on average we only need to explore 2700 additional hy-peredges per sentence to boost the BLEU score from 21.77 to 23.46.
The boost is so significant that overall the dominant part of search time is no longer the second pass but the first bigram pass (inside pass actually) which provides a constrained space and outside heuristics for the second pass.
5.2 Two-pass decoding versus One-pass decoding
By varying the beam size for the first pass, we can plot graphs of model scores versus search time and BLEU scores versus search time as shown in Figure 1.
We use a very large beam for the second pass due to the reason that the outside estimate for the second pass is discriminative enough to guide the
Decoding Method Avg.
Hyperedges BLEU
Bigram Pass
Trigram Pass
Trigram One-pass, with Beam
Table 1: Speed and BLEU scores for two-pass decoding.
UNI stands for the uniform (zero) outside estimate.
BO standsforthebigramoutsidecostestimate.
BBstandsfor the best border estimate, which is added to BO.
Decoder Time BLEU Model Score
Table 2: Summary of different trigram decoding strategies, using about the same time (10 seconds per sentence).
search.
We sum up the total number of seconds for both passes to compare with the baseline systems.
On average, less than 5% of time is spent in the second pass.
In Figure 1, we have four competing decoders. bitrijcyk is our two-pass decoder, using CYK as the first pass decoding algorithm and using agenda-based decoding in the second pass which is guided by the first pass. agenda is our trigram-integrated agenda-based decoder.
The other two systems are also one-pass. cyk is our trigram-integrated CYK decoder. lazy kbest is our top-down k-best-style de-coder.1
Figure 1 (left) compares the search efficiencies of the four systems. bitrijcyk at the top ranks first. cyk follows it.
The curves of lazy kbest and agenda cross
1In our implementation of the lazy-cube-pruning based ITG decoder, we vary the re-ranking buffer size and the the top-fc list size which are the two controlling parameters for the search space.
But we did not use any LM estimate to achieve early stopping as suggested by Huang and Chiang (2007).
Also, we did not have a translation-model-only pruning pass.
So the results shown in this paper for the lazy cube pruning method is not of its best performance.
and are both below the curves of bitrijcyk and cyk.
This figure indicates the advantage of the two-pass decoding strategy in producing translations with a high model score in less time.
However, model scores do not directly translate into BLEU scores.
In Figure 1(right), bitrijcyk is better than CYK only in a certain time window when the beam is neither too small nor too large.
But the window is actually where we are interested - it ranges from 5 seconds per sentence to 20 seconds per sentence.
Table 2 summarizes the performance of the four decoders when the decoding speed is at 10 seconds per sentence.
We have many choices in implementing the bigram decoding pass.
We can do either CYK or agenda-based decoding.
We can also use the dynamic programming hook trick.
We are particularly interested in the effect of the hook trick in a large-scale system with aggressive pruning.
Figure 2 compares the four possible combinations of the decoding choices for the first pass: bitrijcyk, bitri.agenda, bitrijcykJiook and bitri agenda Jiook. bitrijcyk which simply uses CYK as the first pass decoding algorithm is the best in terms of performance and time trade-off.
The hook-based decoders do not show an advantage in our experiments.
Only bitri jagendaJiook gets slightly better than bi-trijagenda when the beam size increases.
So, it is very likely the overhead of building hooks offsets its benefit when we massively prune the hypotheses.
The bitrijcyk decoder spends little time in the agenda-based trigram pass, quickly reaching the goal item starting from the bottom of the chart.
In order to maximize BLEU score using the algorithm described in Section 4, we need a sizable trigram forest as a starting point.
Therefore, we keep popping off more items from the agenda after the goal is reached.
Simply by exploring more (200 times the log beam) after-goal items, we can optimize the Viterbi synchronous parse significantly, shown in Figure 3(left) in terms of model score versus search time.
However, the mismatch between model score and BLEU score persists.
So, we try our algorithm
of maximizing expected count of synchronous constituents on the trigram forest.
We find significant improvement in BLEU, as shown in Figure 3 (right) by the curve of bitri jcykjepassme jcons. bitri jcykjepass me jcons beats both bitrijcyk and cyk in terms of BLEU versus time if using more than 1.5 seconds on average to decode each sentence.
At each time point, the difference in BLEU between bitri jcykjepass me jcons and the highest of bitrijcyk and cyk is around .
5 points consistently as we vary the beam size for the first pass.
We achieve the record-high BLEU score 24.34 using on average 21 seconds per sentence, compared to the next-highest score of 23.92 achieved by cyk using on average 78 seconds per sentence.
6 Conclusion
We present a multi-pass method to speed up n-gram integrated decoding for SCFG.
We use an inside/outside parsing algorithm to get the Viterbi outside cost of bigram integrated states which is used as an outside estimate for trigram integrated states.
The coarse-level outside cost plus the simple estimate for border trigrams speeds up the trigram decoding pass hundreds of times compared to using no outside estimate.
Maximizing the probability of the synchronous derivation is not equivalent to maximizing BLEU.
We use a rescoring decoding pass that maximizes the expected count of synchronous constituents.
This technique, together with the progressive search at previous stages, gives a decoder that produces the highest BLEU score we have obtained on the data in a very reasonable amount of time.
As future work, new metrics for the final pass may be able to better approximate BLEU.
As the bigram decoding pass currently takes the bulk of the decoding time, better heuristics for this phase may speed up the system further.
Acknowledgments This work was supported by
NSF ITR-0428020 and NSF IIS-0546554.
