It is possible to reduce the bulk of phrase-tables for Statistical Machine Translation using a technique based on the significance testing of phrase pair co-occurrence in the parallel corpus.
The savings can be quite substantial (up to 90%) and cause no reduction in BLEU score.
In some cases, an improvement in BLEU is obtained at the same time although the effect is less pronounced if state-of-the-art phrasetable smoothing is employed.
1 Introduction
An important part of the process of Statistical Machine Translation (SMT) involves inferring a large table of phrase pairs that are translations of each other from a large corpus of aligned sentences.
These phrase pairs together with estimates of conditional probabilities and useful feature weights, called collectively a phrasetable, are used to match a source sentence to produce candidate translations.
The choice of the best translation is made based on the combination of the probabilities and feature weights, and much discussion has been made of how to make the estimates of probabilites, how to smooth these estimates, and what features are most useful for discriminating among the translations.
However, a cursory glance at phrasetables produced often suggests that many of the translations are wrong or will never be used in any translation.
On the other hand, most obvious ways of reducing the bulk usually lead to a reduction in translation
quality as measured by BLEU score.
This has led to an impression that these pairs must contribute something in the grand scheme of things and, certainly, more data is better than less.
Nonetheless, this bulk comes at a cost.
Large tables lead to large data structures that require more resources and more time to process and, more importantly, effort directed in handling large tables could likely be more usefully employed in more features or more sophisticated search.
In this paper, we show that it is possible to prune phrasetables using a straightforward approach based on significance testing, that this approach does not adversely affect the quality of translation as measured by BLEU score, and that savings in terms of number of discarded phrase pairs can be quite substantial.
Even more surprising, pruning can actually raise the BLEU score although this phenomenon is less prominent if state of the art smoothing of phrasetable probabilities is employed.
Section 2 reviews the basic ideas of Statistical Machine Translation as well as those of testing significance of associations in two by two contingency tables departing from independence.
From this, a filtering algorithm will be described that keeps only phrase pairs that pass a significance test.
Section 3 outlines a number of experiments that demonstrate the phenomenon and measure its magnitude.
Section 4 presents the results of these experiments.
The paper concludes with a summary of what has been learned and a discussion of continuing work that builds on these ideas.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 961-915, Prague, June 2001.
©2001 Association for Computational Linguistics
2 Background Theory
2.1 Our Approach to Statistical Machine Translation
We define a phrasetable as a set of source phrases (n-grams) s and their translations (m-grams) t, along with associated translation probabilities p(s|s) and p(s| s).
These conditional distributions are derived from the joint frequencies c(s , t) of source / target n, m-grams observed in a word-aligned parallel corpus.
These joint counts are estimated using the phrase induction algorithm described in (Koehn et al., 2003), with symmetrized word alignments generated using IBM model 2 (Brown et al., 1993).
Phrases are limited to 8 tokens in length (n, m < 8).
Given a source sentence s, our phrase-based SMT system tries to find the target sentence t that is the most likely translation of s. To make search more efficient, we use the Viterbi approximation and seek the most likely combination of t and its alignment a with s, rather than just the most likely t:
where a = (Si, ti, ji),(Sk,1k,3k); sk are target phrases such that t = t i...t K; sk are source phrases such that s = s j1 ...sjK; and sk is the translation of the kth target phrase t k.
2002).
Phrase translation model probabilities are features of the form:
i.e., we assume that the phrases sk specified by a are conditionally independent, and depend only on their aligned phrases t k.
The "forward" phrase probabilities pfys) are not used as features, but only as a filter on the set of possible translations: for each source phrase s that matches some ngram in s, only the 30 top-ranked translations t according to pfys) are retained.
One of the reviewers has pointed out correctly that taking only the top 30 translations will interact with the subject under study; however, this pruning technique has been used as a way of controlling the width of our beam search and rebalancing search parameters would have complicated this study and taken it away from our standard practice.
The phrase translation model probabilities are smoothed according to one of several techniques as described in (Foster et al., 2006) and identified in the discussion below.
2.2 Significance testing using two by two
contingency tables
Each phrase pair can be thought of as am n, m-gram (s , t) where s is an n-gram from the source side of the corpus and t is an m-gram from the target side of the corpus.
We then define: C(s , t) as the number of parallel sentences that contain one or more occurrences of s on the source side and t on the target side; C(s ) the number of parallel sentences that contain one or more occurrences of s on the source side; and C(t) the number of parallel sentences that contain one or more occurrences of t on the target side.
Together with N, the number of parallel sentences, we have enough information to draw up a two by two contingency table representing the unconditional relationship between s and 1 This table is shown in Table
A standard statistical technique used to assess the importance of an association represented by a contingency table involves calculating the probability that the observed table or one that is more extreme could occur by chance assuming a model of independence.
This is called a significance test.
Introductory statistics texts describe one such test called the Chi-squared test.
There are other tests that more accurately apply to our small tables with only two rows and columns.
Table 1: Two by two contingency table for s and t
In particular, Fisher's exact test calculates probability of the observed table using the hypergeometric distibution.
The p-value associated with our observed table is then calculated by summing probabilities for tables that have a larger C (s ,s)).
This probability is interpreted as the probability of observing by chance an association that is at least as strong as the given one and hence its significance.
Agresti (1996) provides an excellent introduction to this topic and the general ideas of significance testing in contingency tables.
Fisher's exact test of significance is considered a gold standard since it represents the precise probabilities under realistic assumptions.
Tests such as the Chi-squared test or the log-likelihood-ratio test (yet another approximate test of significance) depend on asymptotic assumptions that are often not valid for small counts.
Note that the count C(s , t) can be larger or smaller than c(s,s) discussed above.
In most cases, it will be larger, because it counts all co-occurrences of s with s rather than just those that respect the word alignment.
It can be smaller though because multiple co-occurrences can occur within a single aligned sentence pair and be counted multiple times in c(s,s).
On the other hand, C(s, t) will not count
all of the possible ways that an n, m-gram match can occur within a single sentence pair; it will count the match only once per sentence pair in which it occurs.
Moore (2004) discusses the use of significance testing of word associations using the log-likelihood-ratio test and Fisher's exact test.
He shows that Fisher's exact test is often a practical method if a number of techniques are followed:
1. approximating the logarithms of factorials using commonly available numerical approximations to the log gamma function,
2. using a well-known recurrence for the hyperge-ometic distribution,
3. noting that few terms usually need to be summed, and
4. observing that convergence is usually rapid.
2.3 Significance pruning
The idea behind significance pruning of phrasetables is that not all of the phrase pairs in a phrasetable are equally supported by the data and that many of the weakly supported pairs could be removed because:
1. the chance of them occurring again might be low, and
2. their occurrence in the given corpus may be the result of an artifact (a combination of effects where several estimates artificially compensate for one another).
This concept is usually referred to as overfit since the model fits aspects of the training data that do not lead to improved prediction.
Phrase pairs that cannot stand on their own by demonstrating a certain level of significance are suspect and removing them from the phrasetable may
be beneficial in terms of reducing the size of data structures.
This will be shown to be the case in rather general terms.
Note that this pruning may and quite often will remove all of the candidate translations for a source phrase.
This might seem to be a bad idea but it must be remembered that deleting longer phrases will allow combinations of shorter phrases to be used and these might have more and better translations from the corpus.
Here is part of the intuition about how phrasetable smoothing may interact with phrasetable pruning: both are discouraging longer but infrequent phrases from the corpus in favour ofcombinations of more frequent, shorter phrases.
Because the probabilities involved below will be so incredibly tiny, we will work instead with the negative of the natural logs of the probabilities.
Thus instead of selecting phrase pairs with a p-value less than exp(—20), we will select phrase pairs with a negative-log-p-value greater than 20.
This has the advantage of working with ordinary-sized numbers and the happy convention that bigger means more pruning.
An important special case of a table occurs when a phrase pair occurs exactly once in the corpus, and each of the component phrases occurs exactly once in its side of the parallel corpus.
These phrase pairs will be referred to as 1-1-1 phrase pairs and the corresponding tables will be called 1-1-1 contingency tables because C (s) = 1, C(s) = 1,and C(s J) = 1.
Moore (2004) comments that the p-value for these tables under Fisher's exact test is 1/N. Since we are using thresholds of the negative logarithm of the p-value, the value a = log(N) is a useful threshold to consider.
In particular, a + e (where e is an appropriately small positive number) is the smallest threshold that results in none of the 1-1-1 phrase pairs being included.
Similarly, a — e is the largest threshold that results in all of the 1-1-1 phrase pairs being included.
Because 1-1-1 phrase pairs can make up a large part of the phrase table, this is important observation for its own sake.
ing the greatest significance (lowest p-value) is the 1-1-1 table, using the threshold of a + e can be used to exclude all of the phrase pairs occurring exactly once (C(s,s) = 1).
The common strategy of deleting all of the 1-count phrase pairs is very similar in effect to the use of the a + e threshold.
3 Experiments
The corpora used for most of these experiments are publicly available and have been used for a number of comparative studies (Workshop on Statistical Machine Translation, 2006).
Provided as part of the materials for the shared task are parallel corpora for French-English, Spanish-English, and German-English as well as language models for English, French, Spanish, and German.
These are all based on the Europarl resources (Europarl, 2003).
The only change made to these corpora was to convert them to lowercase and to Unicode UTF-8.
Phrasetables were produced by symmetrizing IBM2 conditional probabilities as described above.
The phrasetables were then used as a list of n, m-grams for which counts C(s,s), C(s), and C(t) were obtained.
Negative-log-p-values under Fisher's exact test were computed for each of the phrase pairs in the phrasetable and the entry was censored if the negative-log-p-value for the test was below the pruning threshold.
The entries that are kept are ones that are highly significant.
A number of combinations involving many different pruning thresholds were considered: no pruning, 10, a — e, a + e, 15,20,25,50,100, and 1000.
Inaddition, a number of different phrasetable smoothing algorithms were used: no smoothing, Good-Turing smoothing, Kneser-Ney 3 parameter smoothing and the loglinear mixture involving two features called Zens-Ney (Foster et al., 2006).
To test the effects of significance pruning on larger corpora, a series of experiments was run on a much larger corpus based on that distributed for MT06
Chinese-English (NIST MT, 2006).
Since the objective was to assess how the method scaled we used our preferred phrasetable smoothing technique of
BLEU by Pruning Threshold
Table 2: Corpus Sizes and a Values
Phrasetable Size by Pruning Threshold
BLEU by Phrasetable Size
—-QQQ .
no smoothing
number of
parallel sentences a
Zens-Ney and separated our corpus into two phrase-tables, one based on the UN corpus and the other based on the best of the remaining parallel corpora available to us.
Different pruning thresholds were considered: no pruning, 14, 16, 18, 20, and 25.
In addition, another more aggressive method of pruning was attempted.
Moore points out, correctly, that phrase pairs that occur in only one sentence pair, (C(s,s) = 1), are less reliable and might require more special treatment.
These are all pruned automatically at thresholds of 16 and above but not at threshold of 14.
A special series of runs was done for threshold 14 with all of these singletons removed to see whether at these thresholds it was the significance level or the pruning of phrase pairs with (C(s ,t) = 1) that was more important.
This is identified as 14' in the results.
4 Results
The results of the experiments are described in Tables 2 through 6.
Table 2 presents the sizes of the various parallel corpora showing the number of parallel sentences, N, for each of the experiments, together with the a thresholds (a = log(N)).
Table 3 shows the sizes of the phrasetables that result from the various pruning thresholds described for the WMT06 data.
It is clear that this is extremely aggressive pruning at the given levels.
Table 4 shows the corresponding phrasetable sizes for the large corpus Chinese-English data.
The pruning is not as aggressive as for the WMT06 data but still quite sizeable.
Tables 5 and 6 show the main results for the WMT06 and the Chinese-English large corpus experiments.
To make these results more graphic, Figure 1 shows the French —> English data from the WMT06 results in the form of three graphs.
Note
Table 3: WMT06: Distinct phrase pairs by pruning threshold
Table 4: Chinese-English: Distinct phrase pairs by pruning threshold
threshold
that an artificial separation of 1 BLEU point has been introduced into these graphs to separate them.
Without this, they lie on top of each other and hide the essential point.
In compensation, the scale for the BLEU co-ordinate has been removed.
These results are summarized in the following subsections.
In tables 5 and 6, the largest BLEU score for each set of runs has been marked in bold font.
In addition, to highlight that there are many near ties for largest BLEU, all BLEU scores that are within 0.1 of the best are also marked in bold.
When this is done it becomes clear that pruning at a level of 20 for the WMT06 runs would not reduce BLEU in most cases and in many cases would actually increase it.
A pruning threshold of 20 corresponds to discarding roughly 90% of the phrase-
table.
For the Chinese-English large corpus runs, a level of 16 seems to be about the best with a small increase in BLEU and a 60% — 70%> reduction in the size of the phrasetable.
Another view of this can be taken from Tables 5 and 6.
The fraction of the phrasetable retained is a more or less simple function of pruning threshold as shown in Tables 3 and 4.
By including the percentages in Tables 5 and 6, we can see that BLEU goes up as the fraction approaches between 20% and 30%.
This seems to be a relatively stable observation across the experiments.
It is also easily explained by its strong relationship to pruning threshold.
Table 6 shows that this is not just a small corpus phenomenon.
There is a sizeable benefit both in phrase-table reduction and a modest improvement to BLEU even in this case.
4.4 Is this just the same as phrasetable smoothing?
One question that occurred early on was whether this improvement in BLEU is somehow related to the improvement in BLEU that occurs with phrasetable smoothing.
It appears that the answer is, in the main, yes, although there is definitely something else going on.
It is true that the benefit in terms of BLEU is lessened for better types of phrasetable smoothing but the benefit in terms of the reduction in bulk holds.
It is reassuring to see that no harm to BLEU is done by removing even 80% of the phrasetable.
Another question that came up is the role of phrase pairs that occur only once: C(s , t) = 1.
In particular as discussed above, the most significant of these are the 1-1-1 phrase pairs whose components also only occur once: C(s) = 1, and C(t) = 1.
These phrase pairs are amazingly frequent in the phrase-tables and are pruned in all of the experiments except when pruning threshold is equal to 14.
The Chinese-English large corpus experiments give us a good opportunity to show that significance level seems to be more an issue than the case that C (s,t) = 1.
Note that we could have kept the phrase pairs whose marginal counts were greater than one but most of these are of lower significance and likely are pruned already by the threshold.
The given configuration was considered the most likely to yield a benefit and its poor performance led to the whole idea being put aside.
5 Conclusions and Continuing Work
To sum up, the main conclusions are five in number:
Phrasetables produced by the standard Diag-And method (Koehn et al., 2003) can be aggressively pruned using significance pruning without worsening BLEU.
If phrasetable smoothing is not done, the BLEU score will improve under aggressive significance pruning.
If phrasetable smoothing is done, the improvement is small or negligible but there is still no loss on aggressive pruning.
The preservation of BLEU score in the presence of large-scale pruning is a strong effect in small and moderate size phrasetables, but occurs also in much larger phrasetables.
In larger phrasetables based on larger corpora, the percentage of the table that can be discarded appears to decrease.
This is plausible since a similar effect (a decrease in the benefit of smoothing) has been noted with phrasetable smoothing (Foster et al., 2006).
Together these results suggest that, for these corpus sizes, the increase in the number of strongly supported phrase pairs is greater than the increase in the number of poorly supported pairs, which agrees with intuition.
Although there may be other approaches to pruning that achieve a similar effect, the use of Fisher's exact test is mathematically and conceptually one of the simplest since it asks a question separately for each phrase pair: "Considering this phase pair in isolation of any other analysis on the corpus, could it have occurred plausibly by purely random processes inherent in the corpus construction?"
If the answer is "Yes", then it is hard to argue that the phrase pair is an association of general applicability from the evidence in this corpus alone.
Note that the removal of 1-count phrase pairs is subsumed by significance pruning with a threshold greater than a and many of the other simple approaches (from an implementation point of view) are more difficult to justify as simply as the above significance test.
Nonetheless, there remains work to do in determining if computationally simpler approaches do as well.
Moore's work suggests that log-likelihood-ratio would be a cheaper and accurate enough alternative, for example.
We will now return to the interaction of the selection in our beam search of the top 30 candidates based on forward conditional probabilities.
This will affect our results but most likely in the following manner:
For very small thresholds, the beam will become much wider and the search will take much longer.
In order to allow the experiments to complete in a reasonable time, other means will need to be employed to reduce the choices.
This reduction will also interact with the significance pruning but in a less understandable manner.
choices and so there will be no effect.
For intermediate thresholds, the extra pruning might reduce BLEU score but by a small amount because most of the best choices are included in the search.
Using thresholds that remove most of the phrase-table would no doubt qualify as large thresholds so the question is addressing the true shape of the curve for smaller thresholds and not at the expected operating levels.
Nonetheless, this is a subject for further study, especially as we consider alternatives to our "filter 30" approach for managing beam width.
There are a number of important ways that this work can and will be continued.
The code base for taking a list of n, m-grams and computing the required frequencies for signifance evaluation can be applied to related problems.
For example, skip-n-grams (n-grams that allow for gaps of fixed or variable size) may be studied better using this approach leading to insight about methods that weakly approximate patterns.
The original goal of this work was to better understand the character of phrasetables, and it remains a useful diagnostic technique.
It will hopefully lead to more understanding of what it takes to make a good phrasetable especially for languages that require morphological analysis or segmentation to produce good tables using standard methods.
The negative-log-p-value promises to be a useful feature and we are currently evaluating its merits.
6 Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency
Alan Agresti.
1996.
An Introduction to Categorical Data Analysis.
Wiley.
Peter F. Brown, Stephen A. Delia Pietra, Vincent J. Delia Pietra and Robert L. Mercer.
1993.
The Mathematics of Statistical Machine Translation: Parameter estimation.
Computational Linguistics, 19(2):263-312, June.
Philipp Koehn 2003.
Europarl: A Multilingual Corpus for Evaluation of Machine Translation.
Unpublished draft. see
http://www.iccs.inf.ed.ac.uk/~pkoehn /publications/europarl.pdf
George Foster, Roland Kuhn, and Howard Johnson.
2006.
Phrasetable Smoothing for Statistical Machine Translation.
In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia.
Reinhard Kneser and Hermann Ney.
1995.
Improved backing-off for m-gram language modeling.
In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1995, pages 181-184, Detroit, Michigan.
IEEE.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
Statistical phrase-based translation.
In Eduard Hovy, editor, Proceedings of the Human Language Technology Conference of the North American Chapter ofthe Association for Computational Linguistics, pages 127-133, Edmonton, Alberta, Canada, May.
NAACL.
Robert C. Moore.
On Log-Likelihood-Ratios and the Significance of Rare Events.
In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain.
Franz Josef Och.
2003.
Minimum error rate training for statistical machine translation.
In Proceedings ofthe 41th Annual Meeting ofthe Association for Computational Linguistics(ACL), Sapporo, July.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.
2001.
BLEU: A method for automatic evaluation of Machine Translation.
Technical Report
RC22176, IBM, September.
Andreas Stolcke.
2002.
SRILM - an extensible language modeling toolkit.
In Proceedings of the 7th International Conference on Spoken Language Processing
(ICSLP) 2002, Denver, Colorado, September.
Richard Zens and Hermann Ney.
2004.
Improvements in phrase-based statistical machine translation.
In Proceedings ofHuman Language Technology Conference /North American Chapter of the ACL, Boston, May.
Results: BLEU by type of smoothing and pruning threshold
Good-Turing
Zens-Ney
Table 6: Chinese Results: BLEU by pruning threshold
Zens-Ney Smoothing applied to all phrasetables
