We show for the first time that incorporating the predictions of a word sense disambiguation system within a typical phrase-based statistical machine translation (SMT) model consistently improves translation quality across all three different IWSLT Chinese-English test sets, as well as producing statistically significant improvements on the larger NIST Chinese-English MT task— and moreover never hurts performance on any test set, according not only to BLEU but to all eight most commonly used automatic evaluation metrics.
Recent work has challenged the assumption that word sense disambiguation (WSD) systems are useful for SMT.
Yet SMT translation quality still obviously suffers from inaccurate lexical choice.
In this paper, we address this problem by investigating a new strategy for integrating WSD into an SMT system, that performs fully phrasal multi-word disambiguation.
Instead of directly incorporating a Senseval-style WSD system, we redefine the WSD task to match the exact same phrasal translation disambiguation task faced by phrase-based SMT systems.
Our results provide the first known empirical evidence that lexical semantics are indeed useful for SMT, despite claims to the contrary.
*This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under GALE Contract No. HR0011-06-C-0023, and by the Hong Kong Research Grants Council (RGC) research grants
1 Introduction
Common assumptions about the role and usefulness of word sense disambiguation (WSD) models in full-scale statistical machine translation (SMT) systems have recently been challenged.
On the one hand, in previous work (Carpuat and Wu, 2005b) we obtained disappointing results when using the predictions of a Senseval WSD system in conjunction with a standard word-based SMT system: we reported slightly lower BLEU scores despite trying to incorporate WSD using a number of apparently sensible methods.
These results cast doubt on the assumption that sophisticated dedicated WSD systems that were developed independently from any particular NLP application can easily be integrated into a SMT system so as to improve translation quality through stronger models of context and rich linguistic information.
Rather, it has been argued, SMT systems have managed to achieve significant improvements in translation quality without directly addressing translation disambiguation as a WSD task.
Instead, translation disambiguation decisions are made indirectly, typically using only word surface forms and very local contextual information, forgoing the much richer linguistic information that WSD systems typically take advantage of.
On the other hand, error analysis reveals that the performance of SMT systems still suffers from inaccurate lexical choice.
In subsequent empirical studies, we have shown that SMT systems perform much worse than dedicated WSD models, both supervised
RGC6083/99E, RGC6256/00E, and DAG03/04.EG09.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 61-72, Prague, June 2007.
©2007 Association for Computational Linguistics
and unsupervised, on a Senseval WSD task (Carpuat and Wu, 2005a), and therefore suggest that WSD should have a role to play in state-of-the-art SMT systems.
In addition to the Senseval shared tasks, which have provided standard sense inventories and data sets, WSD research has also turned increasingly to designing specific models for a particular application.
For instance, Vickrey et al. (2005) and Specia (2006) proposed WSD systems designed for French to English, and Portuguese to English translation respectively, and present a more optimistic outlook for the use of WSD in MT, although these WSD systems have not yet been integrated nor evaluated in full-scale machine translation systems.
Taken together, these seemingly contradictory results suggest that improving SMT lexical choice accuracy remains a key challenge to improve current SMT quality, and that it is still unclear what is the most appropriate integration framework for the
WSD models in SMT.
In this paper, we present first results with a new architecture that integrates a state-of-the-art WSD model into phrase-based SMT so as to perform multi-word phrasal lexical disambiguation, and show that this new WSD approach not only produces gains across all available Chinese-English IWSLT06 test sets for all eight commonly used automated MT evaluation metrics, but also produces statistically significant gains on the much larger NIST Chinese-English task.
The main difference between this approach and several of our earlier approaches as described in Carpuat and Wu (2005b) and subsequently Carpuat et al. (2006) lies in the fact that we focus on repurposing the WSD system for multi-word phrase-based SMT.
Rather than using a generic Senseval WSD model as we did in Carpuat and Wu (2005b), here both the WSD training and the WSD predictions are integrated into the phrase-based SMT framework.
Furthermore, rather than using a single word based WSD approach to augment a phrase-based SMT model as we did in
scores, here the WSD training and predictions operate on full multi-word phrasal units, resulting in significantly more reliable and consistent gains as eva-luted by many other translation accuracy metrics as well.
Specifically:
• Instead of using a Senseval system, we redefine the WSD task to be exactly the same as lexical choice task faced by the multi-word phrasal translation disambiguation task faced by the phrase-based SMT system.
• Instead of using predefined senses drawn from manually constructed sense inventories such as HowNet (Dong, 1998), our WSD for SMT system directly disambiguates between all phrasal translation candidates seen during SMT training.
• Instead of learning from manually annotated training data, our WSD system is trained on the same corpora as the SMT system.
However, despite these adaptations to the SMT task, the core sense disambiguation task remains
pure WSD:
• The rich context features are typical of WSD and almost never used in SMT.
• The dynamic integration of context-sensitive translation probabilities is not typical of SMT.
• Although it is embedded in a real SMT system, the WSD task is exactly the same as in recent and coming Senseval Multilingual Lexical Sample tasks (e.g., Chklovski et al. (2004)), where sense inventories represent the semantic distinctions made by another language.
We begin by presenting the WSD module and the SMT integration technique.
We then show that incorporating it into a standard phrase-based SMT baseline system consistently improves translation quality across all three different test sets from the Chinese-English IWSLT text translation evaluation, as well as on the larger NIST Chinese-English translation task.
Depending on the metric, the individual gains are sometimes modest, but remarkably, incorporating WSD never hurts, and helps enough to always make it a worthwile additional component in an SMT system.
Finally, we analyze the reasons for the improvement.
2 Problems in context-sensitive lexical choice for SMT
To the best of our knowledge, there has been no previous attempt at integrating a state-of-the-art WSD system for fully phrasal multi-word lexical choice into phrase-based SMT, with evaluation of the resulting system on a translation task.
While there are many evaluations of WSD quality, in particular the Senseval series of shared tasks (Kilgarriff and Rosenzweig (1999), Kilgarriff (2001), Mihalcea et al. (2004)), very little work has been done to address the actual integration of WSD in realistic SMT applications.
To fully integrate WSD into phrase-based SMT, it is necessary to perform lexical disambiguation on multi-word phrasal lexical units; in contrast, the model reported in Cabezas and Resnik (2005) can only perform lexical disambiguation on single words.
Like the model proposed in this paper, Cabezas and Resnik attempted to integrate phrase-based WSD models into decoding.
However, although they reported that incorporating these predictions via the Pharaoh XML markup scheme yielded a small improvement in BLEU score over a Pharaoh baseline on a single Spanish-English translation data set, we have determined empirically that applying their single-word based model to several Chinese-English datasets does not yield systematic improvements on most MT evaluation metrics (Carpuat and Wu, 2007).
The single-word model has the disadvantage of forcing the decoder to choose between the baseline phrasal translation probabilities versus the WSD model predictions for single words.
In addition, the single-word model does not generalize to WSD for phrasal lexical choice, as overlapping spans cannot be specified with the XML markup scheme.
Providing WSD predictions for phrases would require committing to a phrase segmentation of the input sentence before decoding, which is likely to hurt translation quality.
It is also necessary to focus directly on translation accuracy rather than other measures such as alignment error rate, which may not actually lead to improved translation quality; in contrast, for example, Garcia-Varea et al. (2001) and Garcia-Varea et al. (2002) show improved alignment error rate with a maximum entropy based context-dependent lexical
choice model, but not improved translation accuracy.
In contrast, our evaluation in this paper is conducted on the actual decoding task, rather than intermediate tasks such as word alignment.
Moreover, in the present work, all commonly available automated MT evaluation metrics are used, rather than only BLEU score, so as to maintain a more balanced perspective.
Another problem in the context-sensitive lexical choice in SMT models of Garcia Varea et al. is that their feature set is insufficiently rich to make much better predictions than the SMT model itself.
In contrast, our WSD-based lexical choice models are designed to directly model the lexical choice in the actual translation direction, and take full advantage of not residing strictly within the Bayesian source-channel model in order to benefit from the much richer Senseval-style feature set this facilitates.
Garcia Varea et al. found that the best results are obtained when the training of the context-dependent translation model is fully incorporated with the EM training of the SMT system.
As described below, the training of our new WSD model, though not incorporated within the EM training, is also far more closely tied to the SMT model than is the case with traditional standalone WSD models.
In contrast with Brown et al. (1991), our approach incorporates the predictions of state-of-the-art WSD models that use rich contextual features for any phrase in the input vocabulary.
In Brown et al. s early study of WSD impact on SMT performance, the authors reported improved translation quality on a French to English task, by choosing an English translation for a French word based on the single contextual feature which is reliably discriminative.
However, this was a pilot study, which is limited to words with exactly two translation candidates, and it is not clear that the conclusions would generalize to more recent SMT architectures.
3 Problems in translation-oriented WSD
The close relationship between WSD and SMT has been emphasized since the emergence of WSD as an independent task.
However, most of previous research has focused on using multilingual resources typically used in SMT systems to improve WSD accuracy, e.g., Dagan and Itai (1994), Li and Li (2002),
Diab (2004).
In contrast, this paper focuses on the converse goal of using WSD models to improve actual translation quality.
Recently, several researchers have focused on designing WSD systems for the specific purpose of translation.
Vickrey et al. (2005) train a logistic regression WSD model on data extracted from automatically word aligned parallel corpora, but evaluate on a blank filling task, which is essentially an evaluation of WSD accuracy.
Specia (2006) describes an inductive logic programming-based WSD system, which was specifically designed for the purpose of Portuguese to English translation, but this system was also only evaluated on WSD accuracy, and not integrated in a full-scale machine translation system.
Ng et al. (2003) show that it is possible to use automatically word aligned parallel corpora to train accurate supervised WSD models.
The purpose of the study was to lower the annotation cost for supervised WSD, as suggested earlier by Resnik and Yarowsky (1999).
However this result is also encouraging for the integration of WSD in SMT, since it suggests that accurate WSD can be achieved using training data of the kind needed for SMT.
4 Building WSD models for phrase-based
4.1 WSD models for every phrase in the input vocabulary
Just like for the baseline phrase translation model, WSD models are defined for every phrase in the input vocabulary.
Lexical choice in SMT is naturally framed as a WSD problem, so the first step of integration consists of defining a WSD model for every phrase in the SMT input vocabulary.
This differs from traditional WSD tasks, where the WSD target is a single content word.
Sense-val for instance has either lexical sample or all word tasks.
The target words for both categories of Sen-seval WSD tasks are typically only content words— primarily nouns, verbs, and adjectives—while in the context of SMT, we need to translate entire sentences, and therefore have a WSD model not only for every word in the input sentences, regardless of their POS tag, but for every phrase, including tokens such as articles, prepositions and even punctuation.
Further empirical studies have suggested that includ-
ing WSD predictions for those longer phrases is a key factor to help the decoder produce better translations (Carpuat and Wu, 2007).
4.2 WSD uses the same sense definitions as the
Instead of using pre-defined sense inventories, the WSD models disambiguate between the SMT translation candidates.
In order to closely integrate WSD predictions into the SMT system, we need to formulate WSD models so that they produce features that can directly be used in translation decisions taken by the SMT system.
It is therefore necessary for the WSD and SMT systems to consider exactly the same translation candidates for a given word in the input language.
Assuming a standard phrase-based SMT system (e.g., Koehn et al. (2003)), WSD senses are thus either words or phrases, as learned in the SMT phrasal translation lexicon.
Those "sense" candidates are very different from those typically used even in dedicated WSD tasks, even in the multilingual Senseval tasks.
Each candidate is a phrase that is not necessarily a syntactic noun or verb phrase as in manually compiled dictionaries.
It is quite possible that distinct "senses" in our WSD for SMT system could be considered synonyms in a traditional WSD framework, especially in monolingual WSD.
In addition to the consistency requirements for integration, this requirement is also motivated by empirical studies, which show that predefined translations derived from sense distinctions defined in monolingual ontologies do not match translation distinction made by human translators (Specia et al.,
2006).
SMT system
WSD training does not require any other resources than SMT training, nor any manual sense annotation.
We employ supervised WSD systems, since Senseval results have amply demonstrated that supervised models significantly outperform unsuper-vised approaches (see for instance the English lexical sample tasks results described by Mihalcea et al.
(2004)).
Training examples are annotated using the phrase alignments learned during SMT training.
Every in-
put language phrase is sense-tagged with its aligned output language phrase in the parallel corpus.
The phrase alignment method used to extract the WSD training data therefore depends on the one used by the SMT system.
This presents the advantage of training WSD and SMT models on exactly the same data, thus eliminating domain mismatches between Senseval data and parallel corpora.
But most importantly, this allows WSD training data to be generated entirely automatically, since the parallel corpus is automatically phrase-aligned in order to learn the SMT phrase bilexicon.
The word sense disambiguation subsystem is modeled after the best performing WSD system in the Chinese lexical sample task at Senseval-3 (Carpuat
et al., 2004).
The features employed are typical of WSD and are therefore far richer than those used in most SMT systems.
The feature set consists of positionsensitive, syntactic, and local collocational features, since these features yielded the best results when combined in a naïve Bayes model on several Senseval-2 lexical sample tasks (Yarowsky and Florian, 2002).
These features scale easily to the bigger vocabulary and sense candidates to be considered in
a SMT task.
The Senseval system consists of an ensemble of four combined WSD models:
The first model is a naïve Bayes model, since Yarowsky and Florian (2002) found this model to be the most accurate classifier in a comparative study on a subset of Senseval-2 English lexical sample data.
The second model is a maximum entropy model (Jaynes, 1978), since Klein and Manning (Klein and Manning, 2002) found that this model yielded higher accuracy than naive Bayes in a subsequent comparison of WSD performance.
The third model is a boosting model (Freund and Schapire, 1997), since boosting has consistently turned in very competitive scores on related tasks such as named entity classification.
We also use the Adaboost.MH algorithm.
The fourth model is a Kernel PCA-based model (Wu et al., 2004).
Kernel Principal Component Analysis or KPCA is a nonlinear kernel method for
extracting nonlinear principal components from vector sets where, conceptually, the n-dimensional input vectors are nonlinearly mapped from their original space Rn to a high-dimensional feature space F where linear PCA is performed, yielding a transform by which the input vectors can be mapped nonlin-early to a new set of vectors (Scholkopf et al., 1998).
WSD can be performed by a Nearest Neighbor Classifier in the high-dimensional KPCA feature space.
All these classifiers have the ability to handle large numbers of sparse features, many of which may be irrelevant.
Moreover, the maximum entropy and boosting models are known to be well suited to handling features that are highly interdependent.
4.5 Integrating WSD predictions in phrase-based SMT architectures
It is non-trivial to incorporate WSD into an existing phrase-based architecture such as Pharaoh (Koehn, 2004), since the decoder is not set up to easily accept multiple translation probabilities that are dynamically computed in context-sensitive fashion.
For every phrase in a given SMT input sentence, the WSD probabilities can be used as additional feature in a loglinear translation model, in combination with typical context-independent SMT bilexi-con probabilities.
We overcome this obstacle by devising a calling architecture that reinitializes the decoder with dynamically generated lexicons on a per-sentence basis.
Unlike a n-best reranking approach, which is limited by the lexical choices made by the decoder using only the baseline context-independent translation probabilities, our method allows the system to make full use of WSD information for all competing phrases at all decoding stages.
5 Experimental setup
The evaluation is conducted on two standard Chinese to English translation tasks.
We follow standard machine translation evaluation procedure using automatic evaluation metrics.
Since our goal is to evaluate translation quality, we use standard MT evaluation methodology and do not evaluate the accuracy of the WSD model independently.
Table 1: Evaluation results on the IWSLT06 dataset: integrating the WSD translation predictions improves BLEU, NIST, METEOR, WER, PER, CDER and TER across all 3 different available test sets._
Test Set
Table 2: Evaluation results on the NIST test set: integrating the WSD translation predictions improves
Exper.
Preliminary experiments are conducted using training and evaluation data drawn from the multilingual BTEC corpus, which contains sentences used in conversations in the travel domain, and their translations in several languages.
A subset of this data was made available for the IWSLT06 evaluation campaign (Paul, 2006); the training set consists of40000 sentence pairs, and each test set contains around 500 sentences.
We used only the pure text data, and not the speech transcriptions, so that speech-specific issues would not interfere with our primary goal of understanding the effect of integrating WSD in a full-scale phrase-based model.
A larger scale evaluation is conducted on the standard NIST Chinese-English test set (MT-04), which
contains 1788 sentences drawn from newswire corpora, and therefore of a much wider domain than the IWSLT data set.
The training set consists of about 1 million sentence pairs in the news domain.
Basic preprocessing was applied to the corpus.
The English side was simply tokenized and case-normalized.
The Chinese side was word segmented using the LDC segmenter.
Since our focus is not on a specific SMT architecture, we use the off-the-shelf phrase-based decoder
Pharaoh (Koehn, 2004) trained on the IWSLT training set.
Pharaoh implements a beam search decoder for phrase-based statistical models, and presents the advantages of being freely available and widely used.
The phrase bilexicon is derived from the intersection of bidirectional IBM Model 4 alignments, obtained with GIZA++ (Och and Ney, 2003), augmented to improve recall using the grow-diag-final heuristic.
The language model is trained on the English side of the corpus using the SRI language modeling toolkit (Stolcke, 2002).
The loglinear model weights are learned using Chiang s implementation of the maximum BLEU training algorithm (Och, 2003), both for the baseline, and the WSD-augmented system.
Due to time constraints, this optimization was only conducted on the IWSLT task.
The weights used in the WSD-augmented NIST model are based on the best IWSLT model.
Given that the two tasks are quite different, we expect further improvements on the WSD-augmented system after running maximum BLEU optimization for the NIST task.
6 Results and discussion
Using WSD predictions in SMT yields better translation quality on all test sets, as measured by all eight commonly used automatic evaluation metrics.
Table 3: Translation examples with and without WSD for SMT, drawn from IWSLT data sets.
Please transfer to the Chuo train line.
Please turn to the Central Line.
Please transfer to Central Line.
Please get on the bus?
I need a reservation?
Do I need a reservation?
I want to reconfirm this ticket.
I would like to reconfirm a flight for this ticket.
I would like to reconfirm my reservation for this ticket.
Is there on foot?
I have an appointment for a, so please hurry.
I have another appointment, so please hurry.
Excuse me.
Could you tell me the way to Broadway?
I am sorry.
Excuse me, could you tell me the way to Broadway?
Ref.
Excuse me, I want to open an account.
Excuse me, I would like to have an account.
Excuse me, I would like to open an account.
The results are shown in Table 1 for IWSLT and Table 2 for the NIST task.
Paired bootstrap resampling shows that the improvements on the NIST test set are statistically significant at the 95% level.
Remarkably, integrating WSD predictions helps all the very different metrics.
In addition to the widely used BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) scores, we also evaluate translation quality with the recently proposed Meteor (Banerjee and Lavie, 2005) and four edit-distance style metrics, Word Error Rate (WER), Position-independent word Error Rate (PER) (Tillmann et
al. , 1997), CDER, which allows block reordering (Leusch et al., 2006), and Translation Edit Rate (TER) (Snover et al., 2006).
Note that we report Meteor scores computed both with and without using WordNet synonyms to match translation candidates and references, showing that the improvement is not due to context-independent synonym matches at evaluation time.
Table 4: Translation examples with and without WSD for SMT, drawn from the NIST test set.
Without any congressmen voted against him.
No congressmen voted against him.
Russia's policy in Chechnya and CIS neighbors attitude is even more worried that the United States.
Russia's policy in Chechnya and its attitude toward its CIS neighbors cause the United States still more anxiety.
As for the U.S. human rights conditions?
As for the human rights situation in the U.S.?
The purpose of my visit to Japan is pray for peace and prosperity.
The purpose of my visit is to pray for peace and prosperity for Japan.
In order to prevent terrorist activities Los Angeles, the police have taken unprecedented tight security measures.
In order to prevent terrorist activities Los Angeles, the police to an unprecedented tight security measures.
2 and 3, and 95.74% of the NIST test set.
Tables 3 and 4 show examples of translations drawn from the IWSLT and NIST test sets respectively.
A more detailed analysis reveals WSD predictions give better rankings and are more discriminative than baseline translation probabilities, which helps the final translation in three different ways.
• The rich context features help rank the correct translation first with WSD while it is ranked lower according to baseline translation probability scores .
• Even when WSD and baseline translation probabilities agree on the top translation candidate, the stronger WSD scores help override wrong language model predictions.
• The strong WSD scores for phrases help the decoder pick longer phrase translations, while using baseline translation probabilities often translate those phrases in smaller chunks that include a frequent (and incorrect) translation candidate.
For instance, the top 4 Chinese sentences in Ta-
ble 4, are better translated by the WSD-augmented system because the WSD scores help the decoder to choose longer phrases.
In the first example, the phrase m f£ U" is correctly translated as a whole as "No" by the WSD-augmented system, while the baseline translates each word separately yielding an incorrect translation.
In the following three examples, the WSD system encourages the decoder to translate the long phrases " m / jj H a
mS $ ac" as single units, while the baseline introduces errors by breaking them down into shorter phrases.
The last sentence in the table shows an example where the WSD predictions do not help the baseline system.
The translation quality is actually much worse, since the verb IjJ" is incorrectly translated as "to", despite the fact that the top candidate predicted by the WSD system alone is the much better translation "has taken", but with a relatively low
probability of0.509.
7 Conclusion
We have shown for the first time that integrating multi-word phrasal WSD models into phrase-based
SMT consistently helps on all commonly available automated translation quality evaluation metrics on all three different test sets from the Chinese-English IWSLT06 text translation task, and yields statistically significant gains on the larger NIST Chinese-English task.
It is important to note that the WSD models never hurt translation quality, and always yield individual gains of a level that makes their integration always worthwile.
We have proposed to consistently integrate WSD models both during training, where sense definitions and sense-annotated data are automatically extracted from the word-aligned parallel corpora from SMT training, and during testing, where the phrasal WSD probabilities are used by the SMT system just like all the other lexical choice features.
Context features are derived from state-of-the-art WSD models, and the evaluation is conducted on the actual translation task, rather than intermediate tasks such as word alignment.
It is to be emphasized that this approach does not merely consist of adding a source sentence feature in the log linear model for translation.
On the contrary, it remains a real WSD task, defined just as in the Senseval Multilingual Lexical Sample tasks (e.g., Chklovski et al. (2004)).
Our model makes use of typical WSD features that are almost never used in SMT systems, and requires a dynamically created translation lexicon on a per-sentence basis.
To our knowledge this constitues the first attempt at fully integrating state-of-the-art WSD with conventional phrase-based SMT.
Unlike previous approaches, the WSD targets are not only single words, but multi-word phrases, just as in the SMT system.
This means that WSD senses are unusually predicted not only for a limited set of single words or very short phrases, but for all phrases of arbitrarily length that are in the SMT translation lexicon.
The single word approach, as we reported in Carpuat et al. (2006), improved BLEU and NIST scores for phrase-based SMT, but subsequent detailed empirical studies we have performed since then suggest that single word WSD approaches are less successful when evaluated under all other MT metrics
predictions for longer phrases, as reported in this paper, are particularly important to improve translation quality.
The results reported in this paper cast new light on the WSD vs. SMT debate, suggesting that a close integration of WSD and SMT decisions should be incorporated in a SMT model that successfully uses WSD predictions.
Our objective here is to demonstrate that this technique works for the widest possible class of models, so we have chosen as the baseline the most widely used phrase-based SMT model.
Our positive results suggest that our experiments could be tried on other current statistical MT models, especially the growing family of tree-structured SMT models employing stochastic trans-duction grammars of various sorts (Wu and Chiang, 2007).
For instance, incorporating WSD predictions into an MT decoder based on inversion transduction grammars (Wu, 1997)—such as the Bracketing ITG based models of Wu (1996), Zens et al. (2004), or Cherry and Lin (2007)—would present an intriguing comparison with the present work.
It would also be interesting to assess whether a more grammatically structured statistical MT model that is less reliant on an n-gram language model, such as the syntactic ITG based "grammatical channel" translation model of (Wu and Wong, 1998), could make more effective use of WSD predictions.
