We consider here the problem of Base Noun Phrase translation.
We propose a new method to perform the task.
For a given Base NP, we first search its translation candidates from the web.
We next determine the possible translation(s) from among the candidates using one of the two methods that we have developed.
In one method, we employ an ensemble of Naïve Bayesian Classifiers constructed with the EM Algorithm.In the other method, we use TF-IDF vectors also constructed with the EM Algorithm.
Experimental results indicate that the coverage and accuracy of our method are significantly better than those of the baseline methods relying on existing technologies.
Introduction
We address here the problem of Base NP translation, in which for a given Base Noun Phrase in a source language (e.g., 'information age' in English), we are to find out its possible translation(s) in a target language (e.g., ' 'inChinese).
We define a Base NP as a simple and non-recursive noun phrase.
In many cases, Base NPs represent holistic and non-divisible concepts, and thus accurate translation of them from one language to another is extremely important in applications like machine translation, cross language information retrieval, and foreign language writing assistance.
In this paper, we propose a new method for Base NP translation, which contains two steps: (1) translation candidate collection, and (2) translation selection.
In translation candidate collection, for a given Base NP in the source language, we look for its translation candidates in the target language.
To do so, we use a word-to-word translation dictionary and corpus
Hang Li Microsoft Research Asia hangli@microsoft.com data in the target language on the web.
In translation selection, we determine the possible translation(s) from among the candidates.
We use non-parallel corpus data in the two languages on the web and employ one of the two methods which we have developed.
In the first method, we view the problem as that of classification and employ an ensemble of Naïve Bayesian Classifiers constructed with the EM Algorithm.
We will use 'EM-NBC-Ensemble' to denote this method, hereafter.
In the second method, we view the problem as that of calculating similarities between context vectors and use TF-IDF vectors also constructed with the EM Algorithm.
We will use 'EM-TF-IDF' to denote this method.
Experimental results indicate that our method is very effective, and the coverage and top 3 accuracy of translation at the final stage are 91.4% and 79.8%, respectively.
The results are significantly better than those of the baseline methods relying on existing technologies.
The higher performance of our method can be attributed to the enormity of the web data used and the employment of the EM Algorithm.
2.1 Translation with Non-parallel Corpora
A straightforward approach to word or phrase translation is to perform the task by using parallel bilingual corpora (e.g., Brown et al, 1993).
Parallel corpora are, however, difficult to obtain in practice.
To deal with this difficulty, a number of methods have been proposed, which make use of relatively easily obtainable non-parallel corpora (e.g., Fung and Yee, 1998; Rapp, 1999; Diab and Finch, 2000).
Within these methods, it is usually assumed that a number of translation candidates for a word or phrase are given (or can be easily collected) and the problem is focused on translation selection.
All of the proposed methods manage to find out the translation(s) of a given word or phrase, on the basis of the linguistic phenomenon that the contexts of a translation tend to be similar to the contexts of the given word or phrase.
Fung and Yee (1998), for example, proposed to represent the contexts of a word or phrase with a real-valued vector (e.g., a TF-IDF vector), in which one element corresponds to one word in the contexts.
In translation selection, they select the translation candidates whose context vectors are the closest to that of the given word or phrase.
Since the context vector of the word or phrase to be translated corresponds to words in the source language, while the context vector of a translation candidate corresponds to words in the target language, and further the words in the source language and those in the target language have a many-to-many relationship (i.e., translation ambiguities), it is necessary to accurately transform the context vector in the source language to a context vector in the target language before distance calculation.
The vector-transformation problem was not, however, well-resolved previously.
Fung and Yee assumed that in a specific domain there is only one-to-one mapping relationship between words in the two languages.
The assumption is reasonable in a specific domain, but is too strict in the general domain, in which we presume to perform translation here.
A straightforward extension of Fung and Yee's assumption to the general domain is to restrict the many-to-many relationship to that of many-to-one mapping (or one-to-one mapping).
This approach, however, has a drawback of losing information in vector transformation, as will be described.
For other methods using non-parallel corpora, see also (Tanaka and Iwasaki, 1996; Kikui, 1999, Koehn and Kevin 2000; Sumita 2000; Nakagawa 2001;Gaoetal, 2001).
2.2 Translation Using Web Data
Web is an extremely rich source of data for natural language processing, not only in terms of data size but also in terms of data type (e.g., multilingual data, link data).
Recently, a new trend arises in natural language processing, which tries to bring some new breakthroughs to the field by effectively using web data (e.g., Brill et al,
2001).
Nagata et al (2001), for example, proposed to collect partial parallel corpus data on the web to create a translation dictionary.
They observed that there are many partial parallel corpora between English and Japanese on the web, and most typically English translations of Japanese terms (words or phrases) are parenthesized and inserted immediately after the Japanese terms in documents written in Japanese.
Base Noun Phrase Translation
Our method for Base NP translation comprises of two steps: translation candidate collection and translation selection.
In translation candidate collection, we look for translation candidates ofa given Base NP.
In translation selection, we find out possible translation(s) from the translation candidates.
In this paper, we confine ourselves to translation of noun-noun pairs from English to Chinese; our method, however, can be extended to translations of other types of Base NPs between other language pairs.
3.1 Translation Candidate Collection
We use heuristics for translation candidate collection.
Figure 1 illustrates the process of collecting Chinese translation candidates for an English Base NP 'information age' with the heuristics.
Consult English-Chinese word translation dictionary: information -> {h ,&
Compositionally create translation candidates in Chinese:
Search the candidates on web sites in Chinese and obtain the document frequencies of them (i.e., numbers of documents containing them):
Output candidates having non-zero document frequencies and the document frequencies:
Figure 1.
Translation candidate collection
3.2 Translation EM-NBC-Ensemble
Selection
We view the translation selection problem as that of classification and employ EM-NBC-Ensemble to perform the task.
For the ease of explanation, we first describe the algorithm of using only EM-NBC and next extend it to that of using
Basic Algorithm
variable on C .
Let E denote a set of words in English, and C a set of words in Chinese.
Suppose that | E | = m and | C | = n .
Let e represent a random variable on E and c a random variable on C. Figure 2 describes the algorithm.
estimate with Maximum Likelihood Estimation the prior
Figure 2.
Algorithm of EM-NBC-Ensemble Context Information
As input data, we use 'contexts' in English which contain the phrase to be translated.
We also use contexts in Chinese which contain the translation candidates.
Here, a context containing a phrase is defined as the surrounding words within a window of a predetermined size, which window covers the
phrase.
We can easily obtain the data by searching for them on the web.
Actually, the contexts containing the candidates are obtained at the same time when we conduct translation candidate collection (Step 4 in Figure 1).
EM Algorithm
We define a relation between E and C as R c E X C , which represents the links in a translation dictionary.
We further define rc = {e|(e,c)e R}.
We estimate the parameters of the distribution by using the Expectation and Maximization (EM) Algorithm (Dempster et al., 1977).
Next, we estimate the parameters by iteratively updating them, until they converge (cf., Figure 3).
Finally, we calculate fE(c) forall ce C as:
in Chinese D=(fE(c1)^.
/e(c2),..
^JfE(cn)).
Prior Probability Estimation
At Step 2, we approximately estimate the prior probability P(c~) by using the document frequencies of the translation candidates.
The data are obtained when we conduct candidate collection (Step 4 in Figure 1).
At Step 2, we use an EM-based Naïve Bayesian Classifier (EM-NBC) to select the candidates cc whose posterior probabilities are the largest:
Equation (S) is based on Bayes' rule and the assumption that the data in D are independently generated from P(c | c), c g C. In our implementation, we use an equivalent
where « > 1 is an additional parameter used to emphasize the prior information.
If we ignore the first term in Equation (4), then the use of one EM-NBC turns out to select the candidate whose frequency vector is the closest to the transformed vector D in terms of KL divergence (cf., Cover and Tomas 1991).
To further improve performance, we use an ensemble (i.e., a linear combination) of
classifiers are constructed on the basis ofthe data in different contexts with different window sizes.
More specifically, we calculate
where Di, (i = 1, • • •, s) denotes the data in different contexts.
We view the translation selection problem as that of calculating similarities between context vectors and use as context vectors TF-IDF vectors constructed with the EM Algorithm.
Figure 4 describes the algorithm in which we use the same notations as those in
EM-NBC-Ensemble.
The idfvalueofaChinesewordc is calculated in advance and as
idf (c ) = - log( df (c )/ F ) (6) where df(c )denotes the document frequency of c and F the total document frequency.
the EM algorithm; create a TF-IDF vector
3.4 Advantage of Using EM Algorithm
The uses of EM-NBC-Ensemble and EM-TF-IDF can be viewed as extensions of existing methods for word or phrase translation using non-parallel corpora.
Particularly, the use of the EM Algorithm can help to accurately transform a frequency vector from one language to another.
Suppose that we are to determine if ' ' is a translation of 'information age' (actually it is).
The frequency vectors of context words for 'information age' and ' ' are given in A
and D in Figure 5, respectively.
If for each English word we only retain the link connecting to the Chinese translation with the largest frequency (a link represented as a solid line) to establish a many-to-one mapping and transform vector A from English to Chinese, we obtain vector B. It turns out, however, that vector B is quite different from vector D, although they should be similar to each other.
We will refer to this method as 'Major Translation' hereafter.
With EM, vector A in Figure 5 is transformed into vector C, which is much closer to vector D, as expected.
Specifically, EM can split the frequency of a word in English and distribute them into its translations in Chinese in a theoretically sound way (cf., the distributed frequencies of 'internet').
Note that if we assume a many-to-one (or one-to-one) mapping
Internet^
Figure 5.
Example offrequencyvectortransformation
relationship, then the use of EM turns out to be equivalent to that of Major Translation.
In order to further boost the performance of translation, we propose to also use the translation method proposed inNagata et al. Specifically, we combine our method with that of Nagata et al by using a back-off strategy.
Input 'information asymmetry';
Search the English Base NP on web sites in Chinese and obtain documents as follows (i.e., using partial parallel corpora):_
information asymmetry
Find the most frequently occurring Chinese phrases immediately before the brackets containing the English Base NP, using a suffix tree;
Output the Chinese phrases and their document frequencies:
Figure 6.
Nagata et al s method Figure 6 illustrates the process of collecting Chinese translation candidates for an English Base NP 'information asymmetry' with Nagata et al s method.
In the combination of the two methods, we first use Nagata et al s method to perform translation; if we cannot find translations, we next use our method.
We will denote this strategy 'Back-off.
Experimental Results
We conducted experiments on translation of the Base NPs from English to Chinese.
3000 Base NPs extracted.
In the experiments, we used the HIT English-Chinese word translation dictionary2.
The dictionary contains about 76000 Chinese words, 60000 English words, and 118000 translation links.
As a web search engine, we used Google (http://www.google.com).
Five translation experts evaluated the translation results by judging whether or not they were acceptable.
The evaluations reported below are all based on their judgements.
EM-NBC-Ensemble and EM-TF-IDF.
Table 1.
Best translation result for each method
EM-NBC-Ensemble
MT-NBC-Ensemble
EM-KL-Ensemble
EM-TF-IDF
MT-TF-IDF
Table 1 shows the results in terms ol
coverage
and top n accuracy.
Here, coverage is defined as the percentage of phrases which have translations selected, while top n accuracy is defined as the percentage of phrases whose selected top n translations include correct translations.
For EM-NBC-Ensemble, we set the « in (4) to be 5 on the basis of our preliminary experimental results.
For EM-TF-IDF, we used the non-web data describedinSection4.4 to estimate idf values of words.
We used contexts with window sizes of ±1, ±3, ±5, ±7, ±9, ±11.
1 http://encarta.msn.com/Default.asp 2 The dictionary is created by the Harbin Institute of Technology.
Figure 7.
Translation results Figure 7 shows the results of EM-NBC-Ensemble and EM-TF-IDF, in which for EM-NBC-Ensemble 'window size' denotes that of the largest within an ensemble.
Table 1 summarizes the best results for each ofthem.
'Prior' and 'MT-TF-IDF' are actually baseline methods relying on the existing technologies.
In Prior, we select candidates whose prior probabilities are the largest, equivalently, document frequencies obtained in translation candidate collection are the largest.
In MT-TF-IDF, we use TF-IDF vectors transformed with Major Translation.
Our experimental results indicate that both
EM-NBC-Ensemble and EM-TF-IDF
significantly outperform Prior and MT-TF-IDF, when appropriate window sizes are chosen.
The p-values of the sign tests are 0.00056 and 0.00133 for EM-NBC-Ensemble, 0.00002 and 0.00901 for EM-TF-IDF, respectively.
We next removed each of the key components of EM-NBC-Ensemble and used the remaining components as a variant of it to perform translation selection.
The key components are (1) distance calculation by KL divergence (2) EM, (3) prior probability, and (4) ensemble.
The variants, thus, respectively make use of (1) the baseline method 'Prior', (2) an ensemble of Naive Bayesian Classifiers based on Major Translation (MT-NBC-Ensemble), (3) an ensemble of EM-based KL divergence calculations
and Table 1 show the results.
We see that EM-NBC-Ensemble outperforms all of the variants, indicating that all the components within EM-NBC-Ensemble play positive roles.
We removed each of the key components of EM-TF-IDF and used the remaining components as a variant of it to perform translation selection.
The key components are (1) idf value and (2) EM.
The variants, thus, respectively make use of (1) EM-based frequency vectors (EM-TF), (2) the baseline method MT-TF-IDF.
Figure 7 and Table 1 show the results.
We see that EM-TF-IDF outperforms both variants, indicating that all of the components within EM-TF-IDF are needed.
Comparing the results between MT-NBC-Ensemble and EM-NBC-Ensemble and the results between MT-TF-IDF and
Algorithm can indeed help to improve translation accuracies.
Table 2.
Sample of translation outputs
Translation
calcium ion
adventure tale
lung cancer
aircraft carrier
adult literacy
Table 2 shows translations of five Base NPs as output by EM-NBC-Ensemble, in which the translations marked with * were judged incorrect by human experts.
We analyzed the reasons for incorrect translations and found that the incorrect translations were due to: (1) no existence of dictionary entry (19%), (2) non-compositional translation (13%), (3) ranking error (68%).
Table 3.
Our Method
Nagata et al's
We next used Nagata et al's method to perform translation.
From Table 3, we can see that the accuracy of Nagata et al's method is higher than that of our method, but the coverage of it is lower.
The results indicate that our proposed Back-off strategy for translation is justifiable.
Table 4.
Back-off (Ensemble)
In the experiment, we tested the Back-offstrategy, Table 4 shows the results.
The Back-off strategy
helps to further improve the results whether EM-NBC-Ensemble or EM-TF-IDF is used.
To test the effectiveness of the use of web data, we conducted another experiment in which we performed translation by using non-web data.
The data comprised of the Wall Street Journal corpus in English (1987-1992, 500MB) and the People's Daily corpus in Chinese (1982-1998, 700MB).
We followed the Back-off strategy as in Section 4.3 to translate the 1000 Base NPs.
Table 5.
Translation results
Coverage
Web (EM-NBC-Ensemble)
Non-web (EM-NBC-Ensemble)
The results in Table 5 show that the use of web data can yield better results than non-use of it, although the sizes of the non-web data we used were considerably large in practice.
For Nagata et al's method, we found that it was almost impossible to find partial-parallel corpora in the non-web data.
Conclusions
This paper has proposed a new and effective method for Base NP translation by using web data and the EM Algorithm.
Experimental results show that it outperforms the baseline methods based on existing techniques, mainly due to the employment of EM.
Experimental results also show that the use of web data is more effective than non-use of it.
Future work includes further applying the proposed method to the translation of other types of Base NPs and between other language pairs.
Acknowledgements
We thank Ming Zhou, Chang-Ning Huang, Jianfeng Gao, and Ashley Chang for many helpful discussions on this research project.
We also acknowledge Shenjie Li for help with program coding.
