We achieved a state of the art performance in statistical machine translation by using a large number of features with an online large-margin training algorithm.
The millions of parameters were tuned only on a small development set consisting of less than 1K sentences.
Experiments on Arabic-to-English translation indicated that a model trained with sparse binary features outperformed a conventional SMT system with a small number offeatures.
1 Introduction
The recent advances in statistical machine translation have been achieved by discriminatively training a small number of real-valued features based either on (hierarchical) phrase-based translation (Och and Ney, 2004; Koehn et al., 2003; Chiang, 2005) or syntax-based translation (Galley et al., 2006).
However, it does not scale well with a large number of features of the order of millions.
Tillmann and Zhang (2006), Liang et al. (2006) and Bangalore et al. (2006) introduced sparse binary features for statistical machine translation trained on a large training corpus.
In this framework, the problem of translation is regarded as a sequential labeling problem, in the same way as part-of-speech tagging, chunking or shallow parsing.
However, the use of a large number of features did not provide any significant improvements over a conventional small feature set.
Bangalore et al. (2006) trained the lexical choice model by using Conditional Random Fields (CRF)
realized on a WFST.
Their modeling was reduced to Maximum Entropy Markov Model (MEMM) to handle a large number of features which, in turn, faced the labeling bias problem (Lafferty et al., 2001).
Tillmann and Zhang (2006) trained their feature set using an online discriminative algorithm.
Since the decoding is still expensive, their online training approach is approximated by enlarging a merged k-best list one-by-one with a 1-best output.
Liang et al. (2006) introduced an averaged perceptron algorithm, but employed only 1-best translation.
In Watanabe et al. (2006a), binary features were trained only on a small development set using a variant of voted perceptron for reranking k-best translations.
Thus, the improvement is merely relative to the baseline translation system, namely whether or not there is a good translation in their k-best.
We present a method to estimate a large number of parameters — of the order of millions — using an online training algorithm.
Although it was intuitively considered to be prone to overfit-ting, training on a small development set — less than 1K sentences — was sufficient to achieve improved performance.
In this method, each training sentence is decoded and weights are updated at every iteration (Liang et al., 2006).
When updating model parameters, we employ a memorization-variant of a local updating strategy (Liang et al., 2006) in which parameters are optimized toward a set of good translations found in the k-best list across iterations.
The objective function is an approximated BLEU (Watanabe et al., 2006a) that scales the loss of a sentence BLEU to a document-wise loss.
The parameters are trained using the
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 764-773, Prague, June 2007.
©2007 Association for Computational Linguistics
Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006).
MIRA is successfully employed in dependency parsing (McDonald et al., 2005) or the joint-labeling/chunking task (Shimizu and Haas, 2006).
Experiments were carried out on an Arabic-to-English translation task, and we achieved significant improvements over conventional minimum error training with a small number of features.
This paper is organized as follows: First, Section 2 introduces the framework of statistical machine translation.
As a baseline SMT system, we use the hierarchical phrase-based translation with an efficient left-to-right generation (Watanabe et al., 2006b) originally proposed by Chiang (2005).
In Section 3, a set of binary sparse features are defined including numeric features for our baseline system.
Section 4 introduces an online large-margin training algorithm using MIRA with our key components.
The experiments are presented in Section 5 followed by discussion in Section 6.
2 Statistical Machine Translation
We use a log-linear approach (Och, 2003) in which a foreign language sentence f is translated into another language, for example English, e, by seeking a maximum solution:
where h(f, e) is a large-dimension feature vector. w is a weight vector that scales the contribution from each feature.
Each feature can take any real value, such as the log of the n-gram language model to represent fluency, or a lexicon model to capture the word or phrase-wise correspondence.
2.1 Hierarchical Phrase-based SMT
Chiang (2005) introduced the hierarchical phrase-based translation approach, in which non-terminals are embedded in each phrase.
A translation is generated by hierarchically combining phrases using the non-terminals.
Such a quasi-syntactic structure can naturally capture the reordering of phrases that is not directly modeled by a conventional phrase-based approach (Koehn et al., 2003).
The non-terminal embedded phrases are learned from a bilingual corpus without a linguistically motivated syntactic structure.
Based on hierarchical phrase-based modeling, we adopted the left-to-right target generation method (Watanabe et al., 2006b).
This method is able to generate translations efficiently, first, by simplifying the grammar so that the target side takes a phrase-prefixed form, namely a target normalized form.
Second, a translation is generated in a left-to-right manner, similar to the phrase-based approach using Earley-style top-down parsing on the source side.
Coupled with the target normalized form, n-gram language models are efficiently integrated during the search even with a higher order of n.
2.2 Target Normalized Form
In Chiang (2005), each production rule is restricted to a rank-2 or binarized form in which each rule contains at most two non-terminals.
The target normalized form (Watanabe et al., 2006b) further imposes a constraint whereby the target side of the aligned right-hand side is restricted to a Greibach Normal Form like structure:
where X is a non-terminal, y is a source side string of arbitrary terminals and/or non-terminals. bj3 is a corresponding target side where b is a string of terminals, or a phrase, and /J is a (possibly empty) string of non-terminals.
~ defines one-to-one mapping between non-terminals in y and /J. The use of phrase b as a prefix maintains the strength of the phrasebase framework.
A contiguous English side with a (possibly) discontiguous foreign language side preserves phrase-bounded local word reordering.
At the same time, the target normalized framework still combines phrases hierarchically in a restricted manner.
2.3 Left-to-Right Target Generation
Decoding is performed by parsing on the source side and by combining the projected target side.
We applied an Earley-style top-down parsing approach (Wu and Wong, 1998; Watanabe et al., 2006b; Zollmann and Venugopal, 2006).
The basic idea is to perform top-down parsing so that the projected target side is generated in a left-to-right manner.
The search is guided with a push-down automaton, which keeps track of the span of uncovered source
word positions.
Combined with the rest-cost estimation aggregated in a bottom-up way, our decoder efficiently searches for the most likely translation.
The use of a target normalized form further simplifies the decoding procedure.
Since the rule form does not allow any holes for the target side, the integration with an n-gram language model is straightforward: the prefixed phrases are simply concatenated and intersected with n-gram.
3 Features
3.1 Baseline Features
The hierarchical phrase-based translation system employs standard numeric value features:
• n-gram language model to capture the fluency of the target side.
• Hierarchical phrase translation probabilities in both directions, h(y\b0) and_h(b/?
\y), estimated by relative counts, count(y, b8).
• Word-based lexically weighted models of hiex(y\b0) and hiex(b/?
\y) using lexical translation models.
• Word-based insertion/deletion penalties that penalize through the low probabilities of the lexical translation models (Bender et al., 2004).
• Word/hierarchical-phrase length penalties.
• Backtrack-based penalties inspired by the distortion penalties in phrase-based modeling
(Watanabe et al., 2006b).
In addition to the baseline features, a large number of binary features are integrated in our MT system.
We may use any binary features, such as
{1 English word "violate" and Arabic word "tnthk" appeared in e and f. 0 otherwise.
The features are designed by considering the decoding efficiency and are based on the word alignment structure preserved in hierarchical phrase translation pairs (Zens and Ney, 2006).
When hierarchical phrases are extracted, the word alignment is preserved.
If multiple word alignments are observed
Figure 1: An example of sparse features for a phrase translation.
with the same source and target sides, only the frequently observed word alignment is kept to reduce the grammar size.
Word pair features reflect the word correspondence in a hierarchical phrase.
Figure 1 illustrates an example of sparse features for a phrase translation pair f ,f+2 and ei+3 1.
From the word alignment encoded in this phrase, we can extract word pair features of (e,, f+1), (ei+2, f+2) and (e/+3, fj).
The bigrams of word pairs are also used to capture the contextual dependency.
We assume that the word pairs follow the target side ordering.
For instance, we define ((e,_1,fj-1), (e,,f+1)), ((e,, f+1), (e/+2, fy+2)) and ((e/+2, fj+2), (e,+3, fj)) indicated by the arrows in Figure 1.
Extracting bigram word pair features following the target side ordering implies that the corresponding source side is reordered according to the target side.
The reordering of hierarchical phrases is represented by using contextually dependent word pairs across their boundaries, as with the feature ((e,_1, f-1), (e,, fj+1)) in Figure 1.
The above features are insufficient to capture the translation because spurious words are sometimes inserted in the target side.
Therefore, insertion features are integrated in which no word alignment is associated in the target.
The inserted words are associated with all the words in the source sentence, such as (e;+1, f1),(e;+1, f) for the non-aligned word e;+1 with the source sentence fJ in Figure 1.
In the
1For simplicity, we show an example of phrase translation pairs, but it is trivial to define the features over hierarchical
phrases.
Figure 2: Example hierarchical features.
same way, we will be able to include deletion features where a non-aligned source word is associated with the target sentence.
However, this would lead to complex decoding in which all the translated words are memorized for each hypothesis, and thus not integrated in our feature set.
Target side bigram features are also included to directly capture the fluency as in the n-gram language model (Roark et al., 2004).
For instance, bigram features of (e,_1, el), (e,, e;+0, (e,+1, e,+2)... are observed in Figure 1.
In addition to the phrase motivated features, we included features inspired by the hierarchical structure.
Figure 2 shows an example of hierarchical phrases in the source side, consisting of Xtjj —»
Hierarchical features capture the dependency of the source words in a parent phrase to the source words in child phrases, such as (fj_1, fj), (fj_1, f+0, (fj+3, fj), (fj+3, fj+1), (fj, f+2) and (fj+1, f+2) as indicated by the arrows in Figure 2.
The hierarchical features are extracted only for those source words that are aligned with the target side to limit the feature size.
3.3 Normalization
In order to achieve the generalization capability, the following normalized tokens are introduced for each surface form:
• Word class or POS.
Algorithm 1 Online Training Algorithm
"violate" is normalized to "viol+" and "+late" by taking the prefix and suffix, respectively.
"@@@@/@/@@".
We consider all possible combination of those token types.
For example, the word pair feature (violate, tnthk) is normalized and expanded to (viol+, tnthk), (viol+, tnth+), (violate, tnth+), etc. using the 4-letter prefix token type.
4 Online Large-Margin Training
Algorithm 1 is our generic online training algorithm.
The algorithm is slightly different from other online training algorithms (Tillmann and Zhang, 2006; Liang et al., 2006) in that we keep and update oracle translations, which is a set of good translations reachable by a decoder according to a metric, i.e. BLEU (Papineni et al., 2002).
In line 3, a k-best list is generated by bestk(•) using the current weight vector w, for the training instance of ( ft, et).
Each training instance has multiple (or, possibly one) reference translations et for the source sentence f. Using the k-best list, m-best oracle translations Ot is updated by oraclem(-) for every iteration (line 4).
Usually, a decoder cannot generate translations that exactly match the reference translations due to its beam search pruning and OOV.
Thus, we cannot always assign scores for each reference translation.
Therefore, possible oracle translations are maintained according to an objective function,
i.e. BLEU.
Tillmann and Zhang (2006) avoided the problem by precomputing the oracle translations in advance.
Liang et al. (2006) presented a similar updating strategy in which parameters were updated toward an oracle translation found in Ct, but ignored potentially better translations discovered in the past iterations.
New w +1 is computed using the k-best list Ct with respect to the oracle translations Ot (line 5).
After N iterations, the algorithm returns an averaged weight vector to avoid overfitting (line 9).
The key to this online training algorithm is the selection of the updating scheme in line 5.
4.1 Margin Infused Relaxed Algorithm
The Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006) is an online version of the large-margin training algorithm for structured classification (Taskar et al., 2004) that has been successfully used for dependency parsing (McDonald et al., 2005) and joint-labeling/chunking (Shimizu and Haas, 2006).
The basic idea is to keep the norm of the updates to the weight vector as small as possible, considering a margin at least as large as the loss of the incorrect classification.
Line 5 of the weight vector update procedure in Algorithm 1 is replaced by the solution of:
subject to
where S(ft, e) = {wZ} • h(ft, e).
£(•) is a nonnegative slack variable and C > 0 is a constant to control the influence to the objective function.
A larger C implies larger updates to the weight vector.
L() is a loss function, for instance difference of BLEU, that measures the difference between e and e' according to the reference translations et.
In this update, a margin is created for each correct and incorrect translation at least as large as the loss of the incorrect translation.
A larger error means a larger distance between the scores of the correct and incorrect translations.
Following McDonald et al. (2005), only k-best translations are used to form the margins
in order to reduce the number of constraints in Eq.
In the translation task, multiple translations are acceptable.
Thus, margins for m-oracle translation are created, which amount to m X k large-margin constraints.
In this online training, only active features constrained by Eq.
3 are kept and updated, unlike offline training in which all possible features have to be extracted and selected in advance.
The Lagrange dual form of Eq.
3 is:
with the weight vector update:
Equation 4 is solved using a QP-solver, such as a coordinate ascent algorithm, by heuristically selecting (e, e') and by updating a(-) iteratively:
C is used to clip the amount ofupdates.
A single oracle with 1-best translation is analytically solved without a QP-solver and is represented as the following perceptron-like update (Shimizu
Intuitively, the update amount is controlled by the margin and the loss between the correct and incorrect translations and by the closeness of two translations in terms of feature vectors.
Indeed, Liang et al. (2006) employed an averaged perceptron algorithm in which a value was always set to one.
Tillmann and Zhang (2006) used a different update style based on a convex loss function:
Table 1: Experimental results obtained by varying normalized tokens used with surface form.
surface form
w/ prefix/suffix
w/ word class
w/ digits
all token types
where n > 0 is a learning rate for controlling the convergence.
4.2 Approximated BLEU
where /?
„(•) is the n-gram precision of hypothesized given reference translations
translations E
is computed for a set of sentences, not for a single sentence.
Our algorithm requires frequent updates on the weight vector, which implies higher cost in computing the document-wise BLEU.
Tillmann and Zhang (2006) and Liang et al. (2006) solved the problem by introducing a sentence-wise BLEU.
However, the use of the sentence-wise scoring does not translate directly into the document-wise score because of the n-gram precision statistics and the brevity penalty statistics aggregated for a sentence set.
Thus, we use an approximated BLEU score that basically computes BLEU for a sentence set, but accumulates the difference for a particular sentence
(Watanabe et al., 2006a).
The approximated BLEU is computed as follows: Given oracle translations O for T, we maintain the best oracle translations C>T = je1eT}.
The approximated BLEU for a hypothesized translation e for the training instance (f, et) is computed over OT^ except for et, which is replaced by e :
The loss computed by the approximated BLEU measures the document-wise loss ofsubstituting the correct translation et into an incorrect translation e .
The score can be regarded as a normalization which scales a sentence-wise score into a document-wise score.
5 Experiments
We employed our online large-margin training procedure for an Arabic-to-English translation task.
The training data were extracted from the Arabic/English news/UN bilingual corpora supplied by LDC.
The data amount to nearly 3.8M sentences.
The Arabic part ofthe bilingual data is tokenized by isolating Arabic scripts and punctuation marks.
The development set comes from the MT2003 Arabic-English NIST evaluation test set consisting of 663 sentences in the news domain with four reference translations.
The performance is evaluated by the news domain MT2004/MT2005 test set consisting of 707 and 1,056 sentences, respectively.
The hierarchical phrase translation pairs are extracted in a standard way (Chiang, 2005): First, the bilingual data are word alignment annotated by running GIZA++ (Och and Ney, 2003) in two directions.
Second, the word alignment is refined by a grow-diag-final heuristic (Koehn et al., 2003).
Third, phrase translation pairs are extracted together with hierarchical phrases by considering holes.
In the last step, the hierarchical phrases are constrained so that they follow the target normalized form constraint.
A 5-gram language model is trained on the English side of the bilingual data combined with the English Gigaword from LDC.
First, the use of normalized token types in Section 3.3 is evaluated in Table 1.
In this setting, all the structural features in Section 3.2 are used, but differentiated by the normalized tokens combined with surface forms.
Our online large-margin training algorithm performed 50 iterations constrained
Table 2: Experimental results obtained by incrementally adding structural features.
word pairs
+ target bigram
+ insertion
+ hierarchical
Table 3: Experimental results for varying /c-best and m-oracle translations.
# features
l-oracle
lG-oracle
sentence-BLEU
by 10-oracle and 10-best list.
When decoding, a 1000-best list is generated to achieve better oracle translations.
The training took nearly 1 day using 8 cores of Opteron.
The translation quality is evaluated by case-sensitive NIST (Doddington, 2002) and BLEU (Papineni et al., 2002)2.
The table also shows the number of active features in which nonzero values were assigned as weights.
The addition of prefix/suffix tokens greatly increased the number of active features.
The setting severely overfit to the development data, and therefore resulted in worse results in open tests.
The word class3 with surface form avoided the overfitting problem.
The digit sequence normalization provides a similar generalization capability despite of the moderate increase in the active feature size.
By including all token types, we achieved better NIST/BLEU scores for the 2004 and 2005 test sets.
This set of experiments indicates that a token normalization is useful especially trained on a small data.
Second, we used all the normalized token types, but incrementally added structural features in Table 2.
Target bigram features account for only the fluency of the target side without considering the source/target correspondence.
Therefore, the in-
3We induced 50 classes each for English and Arabic.
clusion of target bigram features clearly overfit to the development data.
The problem is resolved by adding insertion features which can take into account an agreement with the source side that is not directly captured by word pair features.
Hierarchical features are somewhat effective in the 2005 test set by considering the dependency structure of the source side.
Finally, we compared our online training algorithm with sparse features with a baseline system in Table 3.
The baseline hierarchical phrase-based system is trained using standard max-BLEU training (MERT) without sparse features (Och, 2003).
Table 3 shows the results obtained by varying the m-oracle and k-best size (k, m = 1 , 10) using all structural features and all token types.
We also experimented sentence-wise BLEU as an objective function constrained by 10-oracle and 10-best list.
Even the 1-oracle 1-best configuration achieved significant improvements over the baseline system.
The use of a larger k-best list further optimizes to the development set, but at the cost of degraded translation quality in the 2004 test set.
The larger m-oracle size seems to be harmful if coupled with the 1-best list.
As indicated by the reduced active feature size, 1-best translation seems to be updated toward worse translations in 10-oracles that are "close" in terms of features.
We achieved significant improvements
Table 4: Two-fold cross validation experiments.
closed test
open test
NIST BLEU
baseline
when the k-best list size was also increased.
The use ofsentence-wise BLEU as an objective provides almost no improvement in the 2005 test set, but is comparable for the 2004 test set.
sets to observe the effect of optimization as shown in Table 44.
The MERT baseline system performed similarly both in closed and open tests.
Our online large-margin training with 10-oracle and 10-best constraints and the approximated BLEU loss function significantly outperformed the baseline system in the open test.
The development data is almost doubled in this setting.
The MERT approach seems to be confused with the slightly larger data and with the mixed domains from different epochs.
6 Discussion
In this work, the translation model consisting ofmil-lions of features are successfully integrated.
In order to avoid poor overfitting, features are limited to word-based features, but are designed to reflect the structures inside hierarchical phrases.
One of the benefit of MIRA is its flexibility.
We may include as many constraints as possible, like m-oracle constraints in our experiments.
Although we described experiments on the hierarchical phrase-based translation, the online training algorithm is applicable to any translation systems, such as phrase-based translations and syntax-based translations.
Online discriminative training has already been studied by Tillmann and Zhang (2006) and Liang et al. (2006).
In their approach, training was performed on a large corpus using the sparse features of phrase translation pairs, target «-grams and/or bag-of-word pairs inside phrases.
In Tillmann and Zhang
4We split data by document, not by sentence.
(2006), k-best list generation is approximated by a step-by-step one-best merging method that separates the decoding and training steps.
The weight vector update scheme is very similar to MIRA but based on a convex loss function.
Our method directly employs the k-best list generated by the fast decoding method (Watanabe et al., 2006b) at every iteration.
One of the benefits is that we avoid the rather expensive cost of merging the k-best list especially when handling millions of features.
Liang et al. (2006) employed an averaged percep-tron algorithm.
They decoded each training instance and performed a perceptron update to the weight vector.
An incorrect translation was updated toward an oracle translation found in a k-best list, but discarded potentially better translations in the past iterations.
An experiment has been undertaken using a small development set together with sparse features for the reranking of a k-best translation (Watanabe et al., 2006a).
They relied on a variant of a voted percep-tron, and achieved significant improvements.
However, their work was limited to reranking, thus the improvement was relative to the performance of the baseline system, whether or not there was a good translation in a list.
In our work, the sparse features are directly integrated into the DP-based search.
The design of the sparse features was inspired by Zens and Ney (2006).
They exploited the word alignment structure inside the phrase translation pairs for discriminatively training a reordering model in their phrase-based translation.
The reordering model simply classifies whether to perform monotone decoding or not.
The trained model is treated as a single feature function integrated in Eq.
Our approach differs in that each sparse feature is individually integrated in Eq.
7 Conclusion
We exploited a large number of binary features for statistical machine translation.
The model was trained on a small development set.
The optimization was carried out by MIRA, which is an online version of the large-margin training algorithm.
Millions of sparse features are intuitively considered prone to overfitting, especially when trained on a small development set.
However, our algorithm with
millions of features achieved very significant improvements over a conventional method with a small number of features.
This result indicates that we can easily experiment many alternative features even with a small data set, but we believe that our approach can scale well to a larger data set for further improved performance.
Future work involves scaling up to larger data and more features.
Acknowledgements
We would like to thank reviewers and our colleagues for useful comment and discussion.
