Word alignment is the problem of annotating parallel text with translational correspondence.
Previous generative word alignment models have made structural assumptions such as the 1-to-1, 1-to-N, or phrase-based consecutive word assumptions, while previous discriminative models have either made such an assumption directly or used features derived from a generative model making one of these assumptions.
We present a new generative alignment model which avoids these structural limitations, and show that it is effective when trained using both unsuper-vised and semi-supervised training methods.
1 Introduction
Several generative models and a large number of discriminatively trained models have been proposed in the literature to solve the problem of automatic word alignment of bitexts.
The generative proposals have required unrealistic assumptions about the structure of the word alignments.
Two assumptions are particularly common.
The first is the 1-to-N assumption, meaning that each source word generates zero or more target words, which requires heuristic techniques in order to obtain alignments suitable for training a SMT system.
The second is the consecutive word-based "phrasal SMT" assumption.
This does not allow gaps, which can be used to particular advantage by SMT models which model hierarchical structure.
Previous discriminative models have either made such assumptions directly or used fea-
tures from a generative model making such an assumption.
Our objective is to automatically produce alignments which can be used to build high quality machine translation systems.
These are presumably close to the alignments that trained bilingual speakers produce.
Human annotated alignments often contain M-to-N alignments, where several source words are aligned to several target words and the resulting unit can not be further decomposed.
Source or target words in a single unit are sometimes non-consecutive.
In this paper, we describe a new generative model which directly models M-to-N non-consecutive word alignments.
The rest of the paper is organized as follows.
The generative story is presented, followed by the mathematical formulation.
Details of the unsupervised training procedure are described.
The generative model is then decomposed into feature functions used in a log-linear model which is trained using a semi-supervised algorithm.
Experiments show improvements in word alignment accuracy and usage of the generated alignments in hierarchical and phrasal SMT systems results in an increased BLEU score.
Previous work is discussed and this is followed by the conclusion.
2 LEAF: a generative word alignment model
We introduce a new generative story which enables the capture of non-consecutive M-to-N alignment structure.
We have attempted to use the same labels as the generative story for Model 4 (Brown et
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 51-60, Prague, June 2007.
©2007 Association for Computational Linguistics
al., 1993), which we are extending.
Our generative story describes the stochastic generation of a target string f (sometimes referred to as the French string, or foreign string) from a source string e (sometimes referred to as the English string), consisting of l words.
The variable m is the length of f. We generally use the index i to refer to source words (ei is the English word at position i), and j to refer to target words.
Our generative story makes the distinction between different types of source words.
There are head words, non-head words, and deleted words.
Similarly, for target words, there are head words, non-head words, and spurious words.
A head word is linked to zero or more non-head words; each nonhead word is linked to from exactly one head word.
The purpose of head words is to try to provide a robust representation of the semantic features necessary to determine translational correspondence.
This is similar to the use of syntactic head words in statistical parsers to provide a robust representation of the syntactic features of a parse sub-tree.
A minimal translational correspondence consists of a linkage between a source head word and a target head word (and by implication, the non-head words linked to them).
Deleted source words are not involved in a minimal translational correspondence, as they were "deleted" by the translation process.
Spurious target words are also not involved in a minimal translational correspondence, as they spontaneously appeared during the generation of other target words.
Figure 1 shows a simple example of the stochastic generation of a French sentence from an English sentence, annotated with the step number in the generative story.
Choose the source word type.
Choose the identity of the head word for each non-head word.
Choose the identity of the generated target head word for each source head word.
Choose the number of words in a target cept conditioned on the identity of the source head word and the source cept size (7i is 1 if the cept size is 1, and 2 if the cept size is greater).
Choose the number of spurious words. choose ip0 according to the distribution
Choose the identity of the spurious words.
Choose the identity of the target non-head words linked to each target head word.
Choose the position of the target head and nonhead words.
absolutely [comma] they do not DEL.
DEL.
HEAD non-head HEAD
THEY ILS
ILS ILS ILS
PAS ne ne
want to spend that money
HEAD non-head HEAD HEAD HEAD
DESIRENT DEPENSER CET ARGENT
ne DESIRENT DEPENSER CET DESIRENT PAS DEPENSER CET DESIRENT PAS DEPENSER CET
ARGENT ARGENT
ARGENT aujourd'hui
Figure 1: Generative story example, (number) indicates step number
if any position was chosen twice, return "failure"
Choose the position of the spuriously generated words.
We note that the steps which return "failure" are required because the model is deficient.
Deficiency means that a portion of the probability mass in the model is allocated towards generative stories which would result in infeasible alignment structures.
Our model has deficiency in the non-spurious target word placement, just as Model 4 does.
It has additional deficiency in the source word linking decisions.
(Och and Ney, 2003) presented results suggesting that the additional parameters required to ensure that a model is not deficient result in inferior performance, but we plan to study whether this is the case for our generative model in future work.
Given e, f and a candidate alignment a, which represents both the links between source and target head-words and the head-word connections of the non-head words, we would like to calculate p(f, a\e).
The formula for this is:
5(i,ir) is the Kronecker delta function which is equal to 1 if i = i' and 0 otherwise.
pi is the position of the closest English head word to the left of the word at i or 0 if there is no such word.
classe(ei) is the word class of the English word at position i, class/(fj) is the word class of the French word at position j, classy (fj) is the word class of the French head word at position j.
p0 and pi are parameters describing the probability of not generating and of generating a target spurious word from each non-spurious target word, Po + Pi = 1.
nik fipi)
The alignment structure used in many other models can be modeled using special cases ofthis framework.
We can express the 1-to-N structure of models like Model 4 by disallowing \i = —1, while for 1-to-l structure we both disallow \i = —1 and de-terministically set ^i = %i. We can also specialize our generative story to the consecutive word M-to-N alignments used in "phrase-based" models, though in this case the conditioning of the generation decisions would be quite different.
This involves adding checks on source and target connection geometry to the generative story which, if violated, would return "failure"; naturally this is at the cost of additional deficiency.
2.2 Unsupervised Parameter Estimation
We can perform maximum likelihood estimation of the parameters of this model in a similar fashion
to that of Model 4 (Brown et al., 1993), described thoroughly in (Och and Ney, 2003).
We use Viterbi training (Brown et al., 1993) but neighborhood estimation (Al-Onaizan et al., 1999; Och and Ney, 2003) or "pegging" (Brown et al., 1993) could also be used.
To initialize the parameters of the generative model for the first iteration, we use bootstrapping from a 1-to-N and a M-to-1 alignment.
We use the intersection of the 1-to-N and M-to-1 alignments to establish the head word relationship, the 1-to-N alignment to delineate the target word cepts, and the M-to-1 alignment to delineate the source word cepts.
In bootstrapping, a problem arises when we encounter infeasible alignment structure where, for instance, a source word generates target words but no link between any of the target words and the source word appears in the intersection, so it is not clear which target word is the target head word.
To address this, we consider each of the N generated target words as the target head word in turn and assign this configuration 1/N of the counts.
For each iteration of training we search for the Viterbi solution for millions of sentences.
Evidence that inference over the space of all possible alignments is intractable has been presented, for a similar problem, in (Knight, 1999).
Unlike phrase-based SMT, left-to-right hypothesis extension using a beam decoder is unlikely to be effective because in word alignment reordering is not limited to a small local window and so the necessary beam would be very large.
We are not aware of admissible or inadmissible search heuristics which have been shown to be effective when used in conjunction with a search algorithm similar to A* search for a model predicting over a structure like ours.
Therefore we use a simple local search algorithm which operates on complete hypotheses.
(Brown et al., 1993) defined two local search operations for their 1-to-N alignment models 3, 4 and 5.
All alignments which are reachable via these operations from the starting alignment are considered.
One operation is to change the generation decision for a French word to a different English word (move), and the other is to swap the generation decision for two French words (swap).
All possible operations are tried and the best is chosen.
This is repeated.
The search is terminated when no opera-
tion results in an improvement.
(Och and Ney, 2003) discussed efficient implementation.
In our model, because the alignment structure is richer, we define the following operations: move French non-head word to new head, move English non-head word to new head, swap heads of two French non-head words, swap heads of two English non-head words, swap English head word links of two French head words, link English word to French word making new head words, unlink English and French head words.
We use multiple restarts to try to reduce search errors.
(Germann et al., 2004; Marcu and Wong, 2002) have some similar operations without the head word distinction.
3 Semi-supervised parameter estimation
Equation 6 defines a log-linear model.
Each feature function hm has an associated weight Am.
Given a vector of these weights A, the alignment search problem, i.e. the search to return the best alignment a of the sentences e and f according to the model, is specified by Equation 7.
We decompose the new generative model presented in Section 2 in both translation directions to provide the initial feature functions for our loglinear model, features 1 to 10 and 16 to 25 in Table
30 in Table 1).
We use the semi-supervised EMD algorithm (Fraser and Marcu, 2006b) to train the model.
The initial M-step bootstraps parameters as described in Section 2.2 from a M-to-1 and a 1-to-N alignment.
We then perform the D-step following (Fraser and
Figure 2: Two alignments with the same transla-tional correspondence
Marcu, 2006b).
Given the feature function parameters estimated in the M-step and the feature function weights A determined in the D-step, the E-step searches for the Viterbi alignment for the full training corpus.
We use 1 — F-Measure as our error criterion.
(Fraser and Marcu, 2006a) established that it is important to tune a (the trade-off between Precision and Recall) to maximize performance.
In working with LEAF, we discovered a methodological problem with our baseline systems, which is that two alignments which have the same translational correspondence can have different F-Measures.
An example is shown in Figure 2.
To overcome this problem we fully interlinked the transitive closure of the undirected bigraph formed by each alignment hypothesized by our baseline alignment systems1.
This operation maps the alignment shown to the left in Figure 2 to the alignment shown to the right.
This operation does not change the collection of phrases or rules extracted from a hypothesized alignment, see, for instance, (Koehn et al., 2003).
Working with this fully interlinked representation we found that the best settings of a were a = 0.1 for the Arabic/English task and a = 0.4 for the French/English task.
We perform experiments on two large alignments tasks, for Arabic/English and French/English data sets.
Statistics for these sets are shown in Table 2.
All of the data used is available from the Linguistic Data Consortium except for the French/English
1All of the gold standard alignments were fully interlinked as distributed.
We did not modify the gold standard alignments.
t(/j\ei) translation without dependency on word-type t(/j\ei) translation table from final HMM iteration s(i>i target cept size without dependency on
so(ipo\ Yl i number of unaligned target words
source head word e
to(/j) identity of unaligned target words
s(^i \ei) target cept size without dependency on Yi
target spurious word penalty
(same features, other direction)
Table 1: Feature functions
gold standard alignments which are available from the authors.
To build all alignment systems, we start with 5 iterations of Model 1 followed by 4 iterations of HMM (Vogel et al., 1996), as implemented in GIZA++ (Och and Ney, 2003).
For all non-LEAF systems, we take the best performing of the "union", "refined" and "intersection" symmetrization heuristics (Och and Ney, 2003) to combine the 1-to-N and M-to-1 directions resulting in a M-to-N alignment.
Because these systems do not output fully linked alignments, we fully link the resulting alignments as described at the end of Section 3.
The reader should recall that this does not change the set of rules or phrases that can be extracted using the alignment.
We perform one main comparison, which is of semi-supervised systems, which is what we will use to produce alignments for SMT.
We compare semi-supervised LEAF with a previous state of the art semi-supervised system (Fraser and Marcu, 2006b).
We performed translation experiments on the alignments generated using semi-supervised training to verify that the improvements in F-Measure result in increases in BLEU.
We also compare the unsupervised LEAF system with GIZA++ Model 4 to give some idea of the performance of the unsupervised model.
We made an effort to optimize the free parameters of GIZA++, while for unsupervised LEAF there are no free parameters to optimize.
A single iteration of unsupervised LEAF2 is compared with heuristic
symmetrization of GIZA++'s extension of Model 4 (which was run for four iterations).
LEAF was bootstrapped as described in Section 2.2 from the HMM Viterbi alignments.
Results for the experiments on the French/English data set are shown in Table 3.
We ran GIZA++ for four iterations of Model 4 and used the "refined" heuristic (line 1).
We ran the baseline semi-supervised system for two iterations (line 2), and in contrast with (Fraser and Marcu, 2006b) we found that the best symmetrization heuristic for this system was "union", which is most likely due to our use of fully linked alignments which was discussed at the end of Section 3.
We observe that LEAF unsupervised (line 3) is competitive with GIZA++ (line 1), and is in fact competitive with the baseline semi-supervised result (line 2).
We ran the LEAF semi-supervised system for two iterations (line 4).
The best result is the LEAF semi-supervised system, with a gain of 1.8 F-Measure over the LEAF unsu-pervised system.
For French/English translation we use a state of the art phrase-based MT system similar to (Och and Ney, 2004; Koehn et al., 2003).
The translation test data is described in Table 2.
We use two trigram language models, one built using the English portion of the training data and the other built using additional English news data.
The BLEU scores reported in this work are calculated using lowercased and tok-enized data.
For semi-supervised LEAF the gain of 0.46 BLEU over the semi-supervised baseline is not statistically significant (a gain of 0.78 BLEU would be required), but LEAF semi-supervised compared with GIZA++ is significant, with a gain of 1.23 BLEU.
We note that this shows a large gain in trans-
while setting \m = 0 for other values of m.
Training
Singletons
Align Discr.
Align Test
Words Links
Dev
Trans.
Test
Sents Words
lation quality over that obtained using GIZA++ because BLEU is calculated using only a single reference for the French/English task.
Results for the Arabic/English data set are also shown in Table 3.
We used a large gold standard word alignment set available from the LDC.
We ran GIZA++ for four iterations of Model 4 and used the "union" heuristic.
We compare GIZA++ (line 1) with one iteration of the unsupervised LEAF model (line 2).
The unsupervised LEAF system is worse than four iterations of GIZA++ Model 4.
We believe that the features in LEAF are too high dimensional to use for the Arabic/English task without the backoffs available in the semi-supervised models.
The baseline semi-supervised system (line 3) was run for three iterations and the resulting alignments were combined with the "union" heuristic.
We ran the LEAF semi-supervised system for two iterations.
The best result is the LEAF semi-supervised system (line 4), with a gain of 5.4 F-Measure over the baseline semi-supervised system.
For Arabic/English translation we train a state of the art hierarchical model similar to (Chiang, 2005) using our Viterbi alignments.
The translation test data used is described in Table 2.
We use two tri-gram language models, one built using the English portion of the training data and the other built using additional English news data.
The test set is from the NIST 2005 translation task.
LEAF had the best performance scoring 1.43 BLEU better than the baseline semi-supervised system, which is statistically significant.
5 Previous Work
The LEAF model is inspired by the literature on generative modeling for statistical word alignment and particularly by Model 4 (Brown et al., 1993).
Much of the additional work on generative modeling of 1-to-N word alignments is based on the HMM model (Vogel et al., 1996).
(Toutanova et al., 2002) and (Lopez and Resnik, 2005) presented a variety of refinements of the HMM model particularly effective for low data conditions.
(Deng and Byrne, 2005) described work on extending the HMM model using a bigram formulation to generate 1-to-N alignment structure.
The common thread connecting these works is their reliance on the 1-to-N approximation, while we have defined a generative model which does not require use of this approximation, at the cost of having to rely on local search.
There has also been work on generative models for other alignment structures.
(Wang and Waibel, 1998) introduced a generative story based on extension of the generative story of Model 4.
The alignment structure modeled was "consecutive M to non-consecutive N".
(Marcu and Wong, 2002) defined the Joint model, which modeled consecutive word M-to-N alignments.
(Matusov et al., 2004) presented a model capable of modeling 1-toN and M-to-1 alignments (but not arbitrary M-to-N alignments) which was bootstrapped from Model 4.
LEAF directly models non-consecutive M-to-N alignments.
One important aspect of LEAF is its symmetry.
(Och and Ney, 2003) invented heuristic symmetriza-
French/English
Arabic/English
LEAF unsupervised
LEAF semi-supervised
Table 3: Experimental Results
tion of the output of a 1-to-N model and a M-to-1 model resulting in a M-to-N alignment, this was extended in (Koehn et al., 2003).
We have used insights from these works to help determine the structure of our generative model.
(Zens et al., 2004) introduced a model featuring a symmetrized lexicon.
(Liang et al., 2006) showed how to train two HMM models, a 1-to-N model and a M-to-1 model, to agree in predicting all of the links generated, resulting in a 1-to-1 alignment with occasional rare 1-to-N or M-to-1 links.
We improve on these works by choosing a new structure for our generative model, the head word link structure, which is both symmetric and a robust structure for modeling of non-consecutive M-to-N alignments.
In designing LEAF, we were also inspired by dependency-based alignment models (Wu, 1997; Alshawi et al., 2000; Yamada and Knight, 2001; Cherry and Lin, 2003; Zhang and Gildea, 2004).
In contrast with their approaches, we have a very flat, one-level notion of dependency, which is bilingually motivated and learned automatically from the parallel corpus.
This idea of dependency has some similarity with hierarchical SMT models such as (Chiang, 2005).
The discriminative component of our work is based on a plethora of recent literature.
This literature generally views the discriminative modeling problem as a supervised problem involving the combination of heuristically derived feature functions.
These feature functions generally include the prediction of some type of generative model, such as the HMM model or Model 4.
A discriminatively trained 1-to-N model with feature functions specifically designed for Arabic was presented in (Ittycheriah and Roukos, 2005).
(Lacoste-Julien et al., 2006) created a discriminative model able to model 1-to-1, 1-to-2 and 2-to-1 alignments for which the best results were obtained using features based on symmetric HMMs trained to agree, (Liang et al., 2006), and
intersected Model 4.
(Ayan and Dorr, 2006) defined a discriminative model which learns how to combine the predictions of several alignment algorithms.
The experiments performed included Model 4 and the HMM extensions of (Lopez and Resnik, 2005).
(Moore et al., 2006) introduced a discriminative model of 1-to-N and M-to-1 alignments, and similarly to (Lacoste-Julien et al., 2006) the best results were obtained using HMMs trained to agree and intersected Model 4.
LEAF is not bound by the structural restrictions present either directly in these models, or in the features derived from the generative models used.
We also iterate the generative/discriminative process, which allows the discriminative predictions to influence the generative model.
our work is most similar to work using discriminative log-linear models for alignment, which is similar to discriminative log-linear models used for the SMT decoding (translation) problem (Och and
a log-linear model combining IBM Model 3 trained in both directions with heuristic features which resulted in a 1-to-1 alignment.
(Fraser and Marcu, 2006b) described symmetrized training of a 1-toN log-linear model and a M-to-1 log-linear model.
These models took advantage of features derived from both training directions, similar to the symmetrized lexicons of (Zens et al., 2004), including features derived from the HMM model and Model 4.
However, despite the symmetric lexicons, these models were only able to optimize the performance of the 1-to-N model and the M-to-1 model separately, and the predictions of the two models required combination with symmetrization heuristics.
We have overcome the limitations of that work by defining new feature functions, based on the LEAF generative model, which score non-consecutive M-to-N alignments so that the final performance criterion can be optimized directly.
6 Conclusion
We have found a new structure over which we can robustly predict which directly models translational correspondence commensurate with how it is used in hierarchical SMT systems.
Our new generative model, LEAF, is able to model alignments which consist of M-to-N non-consecutive translational correspondences.
Unsupervised LEAF is comparable with a strong baseline.
When coupled with a discriminative training procedure, the model leads to increases between 3 and 9 F-score points in alignment accuracy and 1.2 and 2.8 BLEU points in translation accuracy over strong French/English and Arabic/English baselines.
7 Acknowledgments
This work was partially supported under the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022.
We would like to thank the USC Center for High Performance Computing and Communications.
