In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data.
A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment.
We present an approach that explores the tradeoff from the other direction, starting with a context-free translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language.
We obtain substantial improvements in performance for translation from Chinese and Arabic to English.
1 Introduction
The statistical revolution in machine translation, beginning with (Brown et al., 1993) in the early 1990s, replaced an earlier era of detailed language analysis with automatic learning of shallow source-target mappings from large parallel corpora.
Over the last several years, however, the pendulum has begun to swing back in the other direction, with researchers exploring a variety of statistical models that take advantage of source- and particularly target-language syntactic analysis (e.g. (Cowan et al., 2006; Zollmann and Venugopal, 2006; Marcu et al., 2006; Galley et al., 2006) and numerous others).
Chiang (2005) distinguishes statistical MT approaches that are "syntactic" in a formal sense, go-
ing beyond the finite-state underpinnings of phrase-based models, from approaches that are syntactic in a linguistic sense, i.e. taking advantage of a priori language knowledge in the form of annotations derived from human linguistic analysis or tree-banking.1 The two forms of syntactic modeling are doubly dissociable: current research frameworks include systems that are finite state but informed by linguistic annotation prior to training (e.g., (Koehn and Hoang, 2007; Birch et al., 2007; Hassan et al., 2007)), and also include systems employing context-free models trained on parallel text without benefit of any prior linguistic analysis (e.g. (Chiang, 2005; Chiang, 2007; Wu, 1997)).
Over time, however, there has been increasing movement in the direction of systems that are syntactic in both the formal and linguistic senses.
In any such system, there is a natural tension between taking advantage of the linguistic analysis, versus allowing the model to use linguistically un-motivated mappings learned from parallel training data.
The tradeoff often involves starting with a system that exploits rich linguistic representations and relaxing some part of it.
For example, DeNeefe et al. (2007) begin with a tree-to-string model, using treebank-based target language analysis, and find it useful to modify it in order to accommodate useful "phrasal" chunks that are present in parallel training data but not licensed by linguistically motivated parses of the target language.
Similarly, Cowan et al. (2006) focus on using syntactically rich representations of source and target parse trees, but they resort to phrase-based translation for modifiers within
'See (Lopez, to appear) for a comprehensive survey.
clauses.
Finding the right way to balance linguistic analysis with unconstrained data-driven modeling is clearly a key challenge.
In this paper we address this challenge from a less explored direction.
Rather than starting with a system based on linguistically motivated parse trees, we begin with a model that is syntactic only in the formal sense.
We then introduce soft constraints that take source-language parses into account to a limited extent.
Introducing syntactic constraints in this restricted way allows us to take maximal advantage of what can be learned from parallel training data, while effectively factoring in key aspects of linguistically motivated analysis.
As a result, we obtain substantial improvements in performance for both Chinese-English and Arabic-English translation.
which this work builds, and we discuss Chiang's initial effort to incorporate soft source-language constituency constraints for Chinese-English translation.
In Section 3, we suggest that an insufficiently fine-grained view of constituency constraints was responsible for Chiang's lack of strong results, and introduce finer grained constraints into the model.
Section 4 demonstrates the the value of these constraints via substantial improvements in Chinese-English translation performance, and extends the approach to Arabic-English.
Section 5 discusses the results, and Section 6 considers related work.
Finally we conclude in Section 7 with a summary and potential directions for future work.
2 Hierarchical Phrase-based Translation 2.1 Hiero
Hiero (Chiang, 2005; Chiang, 2007) is a hierarchical phrase-based statistical MT framework that generalizes phrase-based models by permitting phrases with gaps.
Formally, Hiero's translation model is a weighted synchronous context-free grammar.
Hiero employs a generalization of the standard non-hierarchical phrase extraction approach in order to acquire the synchronous rules of the grammar directly from word-aligned parallel text.
Rules have the form X — (e, /), where e and / are phrases containing terminal symbols (words) and possibly co-indexed instances of the
nonterminal symbol X.2 Associated with each rule is a set of translation model features, 0j(/, e ); for example, one intuitively natural feature of a rule is the phrase translation (log-)probability </>(/, e) = log p(e | /) , directly analogous to the corresponding feature in non-hierarchical phrase-based models like Pharaoh (Koehn et al., 2003).
In addition to this phrase translation probability feature, Hiero's feature set includes the inverse phrase translation probability logp( /|e), lexical weights lexwt(/|e) and lexwt(e | /), which are estimates of translation quality based on word-level correspondences (Koehn et al., 2003), and a rule penalty allowing the model to learn a preference for longer or shorter derivations; see (Chiang, 2007) for details.
These features are combined using a log-linear model, with each synchronous rule contributing
to the total log-probability of a derived hypothesis.
Each Ai is a weight associated with feature 0i, and these weights are typically optimized using minimum error rate training (Och, 2003).
2.2 Soft Syntactic Constraints
When looking at Hiero rules, which are acquired automatically by the model from parallel text, it is easy to find many cases that seem to respect linguistically motivated boundaries.
For example,
seems to capture the use of jingtian/this year as a temporal modifier when building linguistic constituents such as noun phrases (the election this year) or verb phrases (voted in the primary this year).
However, it is important to observe that nothing in the Hiero framework actually requires nonterminal symbols to cover linguistically sensible constituents, and in practice they frequently do not.3
2This is slightly simplified: Chiang's original formulation of Hiero, which we use, has two nonterminal symbols, X and S. The latter is used only in two special "glue" rules that permit complete trees to be constructed via concatenation of subtrees when there is no better way to combine them.
3For example, this rule could just as well be applied withXi covering the "phrase" submitted and to produce non-constituent substring submitted and this year in a hypothesis like The budget was submitted and this year cuts are likely.
Chiang (2005) conjectured that there might be value in allowing the Hiero model to favor hypotheses for which the synchronous derivation respects linguistically motivated source-language constituency boundaries, as identified using a parser.
He tested this conjecture by adding a soft constraint in the form of a "constituency feature": if a synchronous rule X — (e , /) is used in a derivation, and the span of is a constituent in the source-language parse, then a term Ac is added to the model score in expression (1).
4 Unlike a hard constraint, which would simply prevent the application of rules violating syntactic boundaries, using the feature to introduce a soft constraint allows the model to boost the "goodness" for a rule if it is constitent with the source language constituency analysis, and to leave its score unchanged otherwise.
The weight Ac, like all other Ai, is set via minimum error rate training, and that optimization process determines empirically the extent to which the constituency feature should be trusted.
Figure 1 illustrates the way the constituency feature worked, treating English as the source language for the sake of readability.
In this example, Ac would be added to the hypothesis score for any rule used in the hypothesis whose source side spanned the minister, a speech, yesterday, gave a speech yesterday, or the minister gave a speech yesterday.
A rule translating, say, minister gave a as a unit would receive no such boost.
Chiang tested the constituency feature for Chinese-English translation, and obtained no significant improvement on the test set.
The idea then seems essentially to have been abandoned; it does not appear in later discussions (Chiang, 2007).
3 Soft Syntactic Constraints, Revisited
On the face of it, there are any number of possible reasons Chiang's (2005) soft constraint did not work - including, for example, practical issues like the quality of the Chinese parses.5 However, we focus here on two conceptual issues underlying his use of source language syntactic constituents.
4Formally, </>c(f, e) is defined as a binary feature, with value 1 if f spans a source constituent and 0 otherwise.
In the latter case \c<f>c(f , e) = 0 and the score in expression (1) is unaffected.
5In fact, this turns out not to be the issue; see Section 4.
Overlapping ADVP: —
Overlapping VP: i
Figure 1: Illustration of Chiang's (2005) syntactic constituency feature, which does not distinguish among constituent types.
First, the constituency feature treats all syntactic constituent types equally, making no distinction among them.
For any given language pair, however, there might be some source constituents that tend to map naturally to the target language as units, and
others that do not (Fox, 2002; Eisner, 2003).
Moreover, a parser may tend to be more accurate for some constituents than for others.
Second, the Chiang (2005) constituency feature gives a rule additional credit when the rule's source side overlaps exactly with a source-side syntactic constituent.
Logically, however, it might make sense not just to give a rule X — (e , /) extra credit when
matches a constituent, but to incur a cost when violates a constituent boundary.
Using the example in Figure 1, we might want to penalize hypotheses containing rules where is the minister gave a (and other cases, such as minister gave, minister gave a, and so forth).
These observations suggest a finer-grained approach to the constituency feature idea, retaining the idea of soft constraints, but applying them using various soft-constraint constituency features.
Our first observation argues for distinguishing among constituent types (NP, VP, etc.).
Our second observation argues for distinguishing the benefit of match-
6This accomplishes coverage of the logically complete set of possibilities, which include not only f matching a constituent exactly or crossing its boundaries, but also f being properly contained within the constituent span, properly containing it, or being outside it entirely.
Whenever these latter possibilities occur, f will exactly match or cross the boundaries of some other constituent.
ingconstituents fromthecostofcrossingconstituent boundaries.
We therefore define a space of new features as the cross product
{CP,IP,NP,VP,...} x {=, +}.
where = and + signify matching and crossing boundaries, respectively.
For example, </>NP= would denote a binary feature that matches whenever the span of exactly covers an NP in the source-side parse tree, resulting in ANP= being added to the hypothesis score (expression (1)).
Similarly, <<<>VP+ would denote a binary feature that matches whenever the span of crosses a VP boundary in the parse tree, resulting in Avp+ being subtracted from the hypothesis score.7 For readability from this point forward, we will omit < from the notation and refer to features such as NP= (which one could read as "NP match"), VP+ (which one could read as "VP crossing"), etc.
In addition to these individual features, we define three more variants:
• For each constituent type, e.g. NP, we define a feature NP_ that ties the weights of NP= and NP+.
If NP= matches a rule, the model score is incremented by ANP_, and if NP+ matches, the model score is decremented by the same quantity.
• For each constituent type, e.g. NP, we define a version of the model, NP2, in which NP= and NP+ are both included as features, with separate weights ANP = and ANP +.
'Formally, Ayp+ simply contributes to the sum in expression (1), as with all features in the model, but weight optimization using minimum error rate training should, and does, automatically assign this feature a negative weight.
8We map SBAR and S labels in Arabic parses to CP and IP, respectively, consistent with the Chinese parses.
We map Chinese DP labels to NP.
DNP and LCP appear only in Chinese.
We ran no ADJP experiment in Chinese, because this label virtually aways spans only one token in the Chinese parses.
definitions of XP+, XP_, and XP2 are analogous.
• Similarly, since Chiang's original constituency feature can be viewed as a disjunctive "all-labels=" feature, we also defined "all-labels+", "all-labels2", and "all-labels_" analogously.
4 Experiments
We carried out MT experiments for translation from Chinese to English and from Arabic to English, using a descendant of Chiang's Hiero system.
Language models were built using the SRI Language Modeling Toolkit (Stolcke, 2002) with modified Kneser-Ney smoothing (Chen and Goodman, 1998).
Word-level alignments were obtained using GIZA++ (Och and Ney, 2000).
The baseline model in both languages used the feature set described in Section 2; for the Chinese baseline we also included a rule-based number translation feature (Chiang, 2007).
to the baseline condition, and baseline plus Chiang's (2005) original constituency feature, experimental conditions augmented the baseline with additional features as described in Section 3.
All models were optimized and tested using the BLEU metric (Papineni et al., 2002) with the NIST-implemented ("shortest") effective reference length, on lowercased, tokenized outputs/references.
Statistical significance of difference from the baseline BLEU score was measured by using paired bootstrap re-sampling (Koehn, 2004).
For the Chinese-English translation experiments, we trained the translation model on the corpora in Table 1, totalling approximately 2.1 million sentence pairs after GIZA++ filtering for length ratio.
Chinese text was segmented using the Stanford segmenter (Tseng et al., 2005).
9Whenever we use the word "significant", we mean "statistically significant" (at p < .
05 unless specified otherwise).
Xinhua Ch/Eng Par News VI beta Ch/En Treebank Par Corpus Ch/En News Mag Par Txt (Sinorama) FBIS Multilanguage Txts Ch News Translation Txt Pt 1 HK Par Text (only HKNews)
Table 1: Training corpora for Chinese-English translation
We trained a 5-gram language model using the English (target) side of the training set, pruning 4-gram and 5-gram singletons.
For minimum error rate training and development we used the NIST
MTeval MT03 set.
Table 2 presents our results.
We first evaluated translation performance using the NIST MT06 (nisttext) set.
Like Chiang (2005), we find that the original, undifferentiated constituency feature (Chiang-05) introduces a negligible, statistically insignificant improvement over the baseline.
However, we find that several of the finer-grained constraints (IP=,
significant improvements over baseline (up to .
74 BLEU), and the latter three also improve significantly on the undifferentiated constituency feature.
By combining multiple finer-grained syntactic features, we obtain significant improvements of up to 1.65 BLEU points (NP_, VP2, IP2, all-labels_, and
XP+).
We also obtained further gains using combinations of features that had performed well; e.g., condition IP2.VP2.NP_ augments the baseline features
and NP_ (tying weights of NP= and NP+; see Section 3).
Since component features in those combinations were informed by individual-feature performance on the test set, we tested the best performing conditions from MT06 on a new test set, NIST MT08.
NP= and VP+ yielded significant improvements of up to 1.53 BLEU.
Combination conditions replicated the pattern of results from MT06, including the same increasing order of gains, with improvements up to 1.11 BLEU.
(p < .
Description
Ar News Trans Txt Pt 1 Ar/En Par News Pt 1 Ar/En Treebank En Translation eTIRR Ar/En News Txt
Table 3: Training corpora for Arabic-English translation
tence pairs after GIZA++ length-ratio filtering.
We trained a trigram language model using the English side of this training set, plus the English Gigaword v2 AFP and Gigaword vl Xinhua corpora.
Development and minimum error rate training were done using the NIST MT02 set.
Table 4 presents our results.
We first tested on on the NIST MT03 and MT06 (nist-text) sets.
On MT03, the original, undifferentiated constituency feature did not improve over baseline.
Two individual finer-grained features (PP+ and AdvP=) yielded statistically significant gains up to .
42 BLEU points, and feature combinations AP2, XP2 and all-labels2 yielded signiicant gains up to l.03 BLEU points.
XP2 and all-labels2 also improved signiicantly on the undifferentiated constituency feature, by .
72 and l.ll BLEU points, respectively.
ForMT06, Chiang's original feature improved the baseline significantly — this is a new result using his feature, since he did not experiment with Arabic — as did our our IP=, PP=, and VP= conditions.
Adding individual features PP+ and AdvP= yielded significant improvements up to 1.4 BLEU points over baseline, and in fact the improvement for individual feature AdvP= over Chiang's undifferen-tiated constituency feature approaches signiicance (p < .
075).
More important, several conditions combining features achieved statistically signiicant improvements over baseline of up 1.94 BLEU points: XP2,
AdvP2.
Of these, AdvP2 is also a signiicant improvement over the undifferentiated constituency feature (Chiang-05), with p < .
01.
As we did for Chinese, we tested the best-performing models on a new test set, NIST MT08.
Consistent patterns reappeared: improvements over the baseline up to
lead (also outperforming the undifferentiated constituency feature, p < .
05).
Baseline
Multiple / conflated features:
all-labels2
all-labels_
all-labels+
5 Discussion
The results in Section 4 demonstrate, to our knowledge for the first time, that significant and sometimes substantial gains over baseline can be obtained by incorporating soft syntactic constraints into Hiero's translation model.
Within language, we also see considerable consistency across multiple test sets, in terms of which constraints tend to help most.
Furthermore, our results provide some insight into why the original approach may have failed to yield a positive outcome.
For Chinese, we found that when we defined finer-grained versions of the exact-match features, there was value for some constituency types in biasing the model to favor matching the source language parse.
Moreover, we found that there was signiicant value in allowing the model to be sensitive to violations (crossing boundaries) of source parses.
These results conirm that parser quality was not the limitation in the original work (or at least not the only limitation), since in our experiments the parser was held constant.
Looking at combinations of new features, some "double-feature" combinations (VP2, IP2) achieved large gains, although note that more is not necessarily better: combinations of more features did not yield better scores, and some did not yield any gain at all.
No conflated feature reached significance, but it is not the case that all conflated features are worse than their same-constituent "double-feature" counterparts.
We found no simple correlation between finer-grained feature scores (and/or boundary condition type) and combination or conflation scores.
Since some combinations seem to cancel individual contributions, we can conclude that the higher the number of participant features (of the kinds described here), the more likely a cancellation effect is; therefore, a "double-feature" combination is more likely to yield higher gains than a combination containing more features.
We also investigated whether non-canonical linguistic constituency labels such as PRN, FRAG, UCP and VSB introduce "noise", by means of the XP features — the XP= feature is, in fact, simply the undifferentiated constituency feature, but sensitive only to "standard" XPs.
Although performance of XP=, XP2 and all-labels+ were similar to that of the undifferentiated constituency feature, XP+ achieved
the highest gain.
Intuitively, this seems plausible: the feature says, at least for Chinese, that a translation hypothesis should incur a penalty if it is translating a substring as a unit when that substring is not a canonical source constituent.
Having obtained positive results with Chinese, we explored the extent to which the approach might improve translation using a very different source language.
The approach on Arabic-English translation yielded large BLEU gains over baseline, as well as signiicant improvements over the undiffer-entiated constituency feature.
Comparing the two sets of experiments, we see that there are deinitely language-specific variations in the value of syntactic constraints; for example, AdvP, the top performer in Arabic, cannot possibly perform well for Chinese, since in our parses the AdvP constituents rarely include more than a single word.
At the same time, some IP and VP variants seem to do generally well in both languages.
This makes sense, since — at least for these language pairs and perhaps more generally — clauses and verb phrases seem to correspond often on the source and target side.
We found it more surprising that no NP variant yielded much gain in Arabic; this question will be taken up in future work.
6 Related Work
Space limitations preclude a thorough review of work attempting to navigate the tradeoff between using language analyzers and exploiting unconstrained data-driven modeling, although the recent literature is full of variety and promising approaches.
We limit ourselves here to several approaches that seem most closely related.
Among approaches using parser-based syntactic models, several researchers have attempted to reduce the strictness of syntactic constraints in order to better exploit shallow correspondences in parallel training data.
Our introduction has already briefly noted Cowan et al. (2006), who relax parse-tree-based alignment to permit alignment of non-constituent subphrases on the source side, and translate modiiers using a separate phrase-based model, and DeNeefe et al. (2007), who modify syntax-based extraction and binarize trees (following (Wang et al., 2007b)) to improve phrasal cov-
erage.
Similarly, Marcu et al. (2006) relax their syntax-based system by rewriting target-side parse trees on the fly in order to avoid the loss of "non-syntactifiable" phrase pairs.
Setiawan et al. (2007) employ a "function-word centered syntax-based approach", with synchronous CFG and extended ITG models for reordering phrases, and relax syntactic constraints by only using a small number function words (approximated by high-frequency words) to guide the phrase-order inversion.
Zollman and Venugopal (2006) start with a target language parser and use it to provide constraints on the extraction of hierarchical phrase pairs.
Unlike Hiero, their translation model uses a full range of named nonterminal symbols in the synchronous grammar.
As an alternative way to relax strict parser-based constituency requirements, they explore the use of phrases spanning generalized, categorial-style constituents in the parse tree, e.g. type NP/NN denotes a phrase like the great that lacks only a head noun (say, wall) in order to comprise an NP.
In addition, various researchers have explored the use of hard linguistic constraints on the source side, e.g. via "chunking" noun phrases and translating them separately (Owczarzak et al., 2006), or by performing hard reorderings of source parse trees in order to more closely approximate target-language word order (Wang et al., 2007a; Collins et al., 2005).
Finally, another soft-constraint approach that can also be viewed as coming from the data-driven side, adding syntax, is taken by Riezler and Maxwell (2006).
They use LFG dependency trees on both source and target sides, and relax syntactic constraints by adding a "fragment grammar" for un-parsable chunks.
They decode using Pharaoh, augmented with their own log-linear features (such as p(esnippet! fsnippet) and its converse), side by side to "traditional" lexical weights.
Riezler and Maxwell (2006) do not achieve higher BLEU scores, but do score better according to human grammaticality judgments for in-coverage cases.
7 Conclusion
When hierarchical phrase-based translation was introduced by Chiang (2005), it represented a new and successful way to incorporate syntax into statistical MT, allowing the model to exploit non-local depen-
dencies and lexically sensitive reordering without requiring linguistically motivated parsing of either the source or target language.
An approach to incorporating parser-based constituents in the model was explored briefly, treating syntactic constituency as a soft constraint, with negative results.
In this paper, we returned to the idea of linguistically motivated soft constraints, and we demonstrated that they can, in fact, lead to substantial improvements in translation performance when integrated into the Hiero framework.
We accomplished this using constraints that not only distinguish among constituent types, but which also distinguish between the beneit of matching the source parse bracketing, versus the cost of using phrases that cross relevant bracketing boundaries.
We demonstrated improvements for Chinese-English translation, and succeed in obtaining substantial gains for Arabic-English translation, as well.
Our results contribute to a growing body of work on combining monolingually based, linguistically motivated syntactic analysis with translation models that are closely tied to observable parallel training data.
Consistent with other researchers, we ind that "syntactic constituency" may be too coarse a notion by itself; rather, there is value in taking a iner-grained approach, and in allowing the model to decide how far to trust each element of the syntactic analysis as part of the system's optimization process.
Acknowledgments
This work was supported in part by DARPA prime agreement HR0011-06-2-0001.
The authors would like to thank David Chiang and Adam Lopez for making their source code available; the Stanford Parser team and Mary Harper for making their parsers available; David Chiang, Amy Weinberg, and CLIP Laboratory colleagues, particularly Chris Dyer, Adam Lopez, and Smaranda Muresan, for discussion and invaluable assistance.
