This paper proposes a learning and extracting method of word sequence correspondences from non-aligned parallel corpora with Support Vector Machines, which have high ability ofthe generalization, rarely cause over-fit for training samples and can learn dependencies of features by using a kernel function.
Our method uses features for the translation model which use the translation dictionary, the number of words, part-of-speech, constituent words and neighbor words.
Experiment results in which Japanese and English parallel corpora are used archived 81.1 % precision rate and 69.0 % recall rate of the extracted word sequence correspondences.
1 Introduction
Translation dictionaries used in multilingual natural language processing such as machine translation have been made manually, but a great deal of labor is required for this work and it is difficult to keep the description of the dictionaries consistent.
Therefore, researches of extracting translation pairs from parallel corpora automatically become active recently (Gale and Church, 1991; Kaji and Aizono, 1996; Tanaka and Iwasaki, 1996; Kita-mura and Matsumoto, 1996; Fung, 1997; Melamed, 1997; Sato and Nakanishi, 1998).
This paper proposes a learning and extracting method of bilingual word sequence correspondences from non-aligned parallel corpora with Support Vector Machines (SVMs) (Vapnik, 1999).
SVMs are ones of large margin classifiers (Smola et al., 2000) which are based on the strategy where margins between separating boundary and vectors of which elements express the features of training samples is maximized.
Therefore, SVMs have higer ability of the generalization than other learning models such as the decision trees and rarely
cause over-fit for training samples.
In addition, by using kernel functions, they can learn non-linear separating boundary and dependencies between the features.
Therefore, SVMs have been recently used for the natural language processing such as text categorization (Joachims, 1998; Taira and Haruno, 1999), chunk identification (Kudo and Matsumoto, 2000b), dependency structure analysis (Kudo and
Matsumoto, 2000a).
The method proposed in this paper does not require aligned parallel corpora which do not exist too many at present.
Therefore, without limiting applicable domains, word sequence correspondences can been extracted.
2 Support Vector Machines
SVMs are binary classifiers which linearly separate d dimension vectors to two classes.
Each vector represents the sample which has d features.
It is distinguished whether given sample X = (x\, x2,..., xd) belongs to Xi or X2 by equation (1) :
where g(X) is the hyperplain which separates two classes in which wX and b are decided by optimization.
Let supervise signals for the training samples be expressed as
where Xi is a set of positive samples and X2 is a set of negative samples.
If the training samples can be separated linearly, there could exist two or more pairs of wX and b that
Figure 1: A separating hyperplain
satisfy equation (1).
Therefore, give the following constraints :
Figure 1 shows that the hyperplain which separates the samples.
In this figure, solid line shows separating hyperplain W • X + b = 0 and two dotted lines show hyperplains expressed by W • X + b = ±1.
The constraints (3) mean that any vectors must not exist inside two dotted lines.
The vectors on dotted lines are called support vectors and the distance between dotted lines is called a margin, which equals to 2/||W||.
The learning algorithm for SVMs could optimize W and b which maximize the margin 2/||W|| or minimize ||W||2/2 subject to constraints (3).
According to Lagrange's theory, the optimization problem is transformed to minimizing the Lagrangian L :
Consequently, the optimization problem is transformed to maximizing the object function D subject to 2n=1 Ajyj = 0 and Aj > 0.
For the optimal parameters A* = arg max^ D, each training sample Xj where A* > 0 is corresponding to support vector.
W can be obtained from equation (5) and b can be obtained from
where X is an arbitrary support vector.
From equation (2) (5), the optimal hyperplain can be expressed as the following equation with optimal parameters
The training samples could be allowed in some degree to enter the inside of the margin by changing equation (3) to :
where £ > 0 are called slack variables.
At this time, the maximal margin problem is enhanced as minimizing ||W||2/2 + C£"=1 where C expresses the weight of errors.
As a result, the problem is to maximize the object function D subject to £"=1 Ajyj = 0 and 0 < Aj < C.
For the training samples which cannot be separated linearly, they might be separated linearly in higher dimension by mapping them using a nonlinear function:
A linear separating in Rd for p(X) is same as a nonlinear separating in Rd for X. Let p satisfy
where K(X, x') is called kernel function.
As a result, the object function is rewritten to
and the optimal hyperplain is rewritten to
Note that p does not appear in equation (11) (12).
Therefore, we need not calculate ( in higher dimension.
The well-known kernel functions are the polynomial kernel function (13) and the Gaussian kernel function (14).
A non-linear separating using one of these kernel functions is corresponding to separating with consideration of the dependencies between the features
in Rd.
3 Extracting Word Sequence Correspondences with SVMs
The method proposed in this paper can obtain word sequence correspondences (translation pairs) in the parallel corpora which include Japanese and English sentences.
It consists of the following three steps:
Make training samples which include positive samples as translation pairs and negative samples as non-translation pairs from the training corpora manually, and learn a translation model from these with SVMs.
Make a set of candidates of translation pairs which are pairs of phrases obtained by parsing both Japanese sentences and English sentences.
Extract translation pairs from the candidates by inputting them to the translation model made in step 1.
3.2 Features for the Translation Model
To apply SVMs for extracting translation pairs, the candidates of the translation pairs must be converted into feature vectors.
In our method, they are composed of the following features:
Features which use an existing translation dictionary.
(a) Bilingual word pairs in the translation dictionary which are included in the candidates of the translation pairs.
(b) Bilingual word pairs in the translation dictionary which are co-occurred in the context in which the candidates appear.
Features which use the number of words.
(a) The number of words in Japanese phrases.
(b) The number of words in English phrases.
Features which use the part-of-speech.
(a) The ratios of appearance of noun, verb, adjective and adverb in Japanese phrases.
(b) The ratios of appearance of noun, verb, adjective and adverb in English phrases.
Features which use constituent words.
(a) Constituent words in Japanese phrases.
(b) Constituent words in English phrases.
Features which use neighbor words.
(a) Neighbor words which appear in Japanese phrases just before or after.
(b) Neighbor words which appear in English phrases just before or after.
Two types of the features which use an existing translation dictionary are used because the improvement of accuracy can be expected by effectively using existing knowledge in the features.
For features (1a), words included in a candidate of the translation pair are looked up with the translation dictionary and the bilingual word pairs in the candidate become features.
They are based on the idea that a translation pair would include many bilingual word pairs.
Each bilingual word pair included in the dictionary is allocated to the dimension of the feature vectors.
If a bilingual word pair appears in the candidate of translation pair, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0.
For features (1b), all pairs of words which co-occurred with a candidate of the translation pair are looked up with the translation dictionary and the bilingual word pairs in the dictionary become features.
They are based on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs although expressed in the two different languages (Kaji and Aizono, 1996).
The candidates are converted into the feature vectors just like (1a).
Features (2a) (2b) are based on the idea that there is a correlation in the number of constituent words
of the phrases of both languages in the translation pair.
The number of constituent words of each language is used for the feature vector.
Features (3a) (3b) are based on the idea that there is a correlation in the ratio of content words (noun, verb, adjective and adverb) which appear in the phrases of both languages in a translation pair.
The ratios of the numbers of noun, verb, adjective and adverb to the number of words of the phrases of each language are used for the feature vector.
For features (4a) (4b), each content word (noun, verb, adjective and adverb) is allocated to the dimension of the feature vectors for each language.
If a word appears in the candidate of translation pair, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0.
For features (5a) (5b), each content words (noun, verb, adjective and adverb) is allocated to the dimension of the feature vectors for each language.
If a word appears in the candidate of translation pair just before or after, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0.
3.3 Learning the Translation Model
Training samples which include positive samples as the translation pairs and negative samples as the non-translation pairs are made from the training corpora manually, and are converted into the feature vectors by the method described in section 3.2.
For supervise signals yj, each positive sample is assigned to +1 and each negative sample is assigned to -1.
The translation model is learned from them by SVMs described in section 2.
As a result, the optimal parameters A* for SVMs are obtained.
3.4 Making the Candidate of the Translation
A set of candidates of translation pairs is made from the combinations of phrases which are obtained by parsing both Japanese and English sentences.
How to make the combinations does not require sentence alignments between both languages.
Because the set grows too big for all the combinations, the phrases used for the combinations are limited in upper bound of the number of constituent words and only noun phrases and verb phrases.
3.5 Extracting the Translation Pairs
The candidates of the translation pairs are converted into the feature vectors with the method described in section 3.2.
By inputting them to equation (8)
with the optimal parameters A* obtained in section 3.3, +1 or -1 could be obtained as the output for each vector.
If the output is +1, the candidate corresponding to the input vector is the translation pair, otherwise it is not the translation pair.
4 Experiments
To confirm the effectiveness of the method described in section 3, we did the experiments where the English Business Letter Example Collection published from Nihon Keizai Shimbun Inc. are used as parallel corpora, which include Japanese and English sentences which are examples of business letters, and are marked up at translation pairs.
As both training and test corpora, 1,000 sentences were used.
The translation pairs which are already marked up in the corpora were corrected to the form described in section 3.4 to be used as the positive samples.
Japanese sentences were parsed by KNP 1 and English sentences were parsed by Apple Pie Parser 2.
The negative samples of the same number as the positive samples were randomly chosen from combinations of phrases which were made by parsing and of which the numbers of constituent words were below 8 words.
As a result, 2,000 samples (1,000 positives and 1,000 negatives) for both training and test were prepared.
The obtained samples must be converted into the feature vectors by the method described in section 3.2.
For features (1a) (1b), 94,511 bilingual word pairs included in EDICT 3 were prepared.
For features (4a) (4b) (5 a) (5b), 1,009 Japanese words and 890 English words which appeared in the training corpora above 3 times were used.
Therefore, the number of dimensions for the feature vectors was 94, 511x2+1x2+4x2+1, 009+890+1, 009+890 = 192, 830.
S VMljght 4 was used for the learner and the classifier of SVMs.
For the kernel function, the squared polynomial kernel (p = 2 in equation (13)) was used, and the error weight C was set to 0.01.
The translation model was learned by the training samples and the translation pairs were extracted from the test samples by the method described in section 3.
4http://svmlight.joachims.org/
Figure 2: Transition in the precision rate and the recall rate when the number of the training samples are increased
Table 1 shows the precision rate and the recall rate of the extracted translation pairs, and table 2 shows examples of the extracted translation pairs.
Table 1: Precision and recall rate
Precision
5 Discussion
Figure 2 shows the transition in the precision rate and the recall rate when the number of the training samples are increased from 100 to 2,000 by every 100 samples.
The recall rate rose according to the number of the training samples, and reaching the level-off in the precision rate since 1,300.
Therefore, it suggests that the recall rate can be improved without lowering the precision rate too much by increasing the number ofthe training samples.
Figure 3 shows that the transition in the precision rate and the recall rate when the number ofthe bilingual word pairs in the translation dictionary are increased from 0 to 90,000 by every 5,000 pairs.
The precision rate rose almost linearly according to the number of the pairs, and reaching the level-off in the recall rate since 30,000.
Therefore, it suggests that the precision rate can be improved without lowering the recall rate too much by increasing the number of the bilingual word pairs in the translation dictionary.
Table 3 shows the precision rate and the recall rate when each kind of features described in section 3.2 was removed.
The values in parentheses in the columns of the precision rate and the recall rate are
Figure 3: Transition in the precision rate and the recall rate when the number of the bilingual word pairs in the translation dictionary are increased
differences with the values when all the features are used.
The fall of the precision rate when the features which use the translation dictionary (1a) (1b) were removed and the fall of the recall rate when the features which use the number of words (2a) (2b) were removed were especially large.
It is clear that feature (1a) (1b) could restrict the translation model most strongly in all features.
Therefore, if feature (1a) (1b) were removed, it causes a good translation model not to be able to be learned only by the features of the remainder because of the weak constraints, wrong outputs increased, and the precision rate has fallen.
Only features (2a) (2b) surely appear in all samples although some other features appeared in the training samples may not appear in the test samples.
So, in the test samples, the importance of features (2a) (2b) are increased on the coverage of the samples relatively.
Therefore, if features (2a) (2b) were removed, it causes the recall rate to fall because of the low coverage of the samples.
6 Related Works
With difference from our method, there have been researches which are based on the assumption of the sentence alignments for parallel corpora (Gale and Church, 1991; Kitamura and Matsumoto, 1996; Melamed, 1997).
(Gale and Church, 1991) has used the p2 statistics as the correspondence level of the word pairs and has showed that it was more effective than the mutual information.
(Kitamura and Mat-sumoto, 1996) has used the Dice coefficient (Kay and Roschesen, 1993) which was weighted by the logarithm of the frequency of the word pair as the
Table 2: Examples of translation pairs extracted by our method
Japanese
chairman of a special program committee
officially retired as
would like to say an official farewell
my thirty years of experience
sharpen up on my golf
Table 3: Precision rate and recall rate when each kind of features is removed
Num.
Corrects
All features
correspondence level of the word pairs.
(Melamed, 1997) has proposed the Competitive Linking Algorithm for linking the word pairs and a method which calculates the optimized correspondence level ofthe word pairs by hill climbing.
These methods could archive high accuracy because of the assumption of the sentence alignments for parallel corpora, but they have the problem with narrow applicable domains because there are not too many parallel corpora with sentence alignments at present.
However, because our method does not require sentence alignments, it can be applied for wider applicable domains.
Like our method, researches which are not based on the assumption of the sentence alignments for parallel corpora have been done (Kaji and Aizono, 1996; Tanaka and Iwasaki, 1996; Fung, 1997).
They are based on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs although expressed in two different languages.
(Kaji and Aizono, 1996) has proposed the correspondence level calculated by the size of intersection between co-occurrence sets with the word included in an ex-
isting translation dictionary.
(Tanaka and Iwasaki, 1996) has proposed a method for obtaining the bilingual word pairs by optimizing the matrix ofthe translation probabilities so that the distance of the matrices of the probabilities of co-occurrences of words which appeared in each language might become small.
(Fung, 1997) has calculated the vectors in which the weighted mutual information between the word in the corpora and the word included in an existing translation dictionary was an element, and has used these inner products as the correspondence level of word pairs.
There is a common point between these method and ours on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs because features (1b) are based on the same idea.
However, since our method caught extracting the translation pairs as the approach of the statistical machine learning, it could be expected to improve the performance by adding new features to the translation model.
In addition, if learning the translation model for the training samples is done once with our method, the model need not be learned again for new samples although
it needs the positive and negative samples for the training data.
However, the methods introduced above must learn a new model again for new corpora.
(Sato and Nakanishi, 1998) has proposed a method for learning a probabilistic translation model with Maximum Entropy (ME) modeling which was the same approach of the statistical machine learning as SVMs, in which co-occurrence information and morphological information were used as features and has archived 58.25 % accuracy with 4,119 features.
ME modeling might be similar to SVMs on using features for learning a model, but feature selection for ME modeling is more difficult because ME modeling is easier to cause over-fit for training samples than SVMs.
In addition, ME modeling cannot learn dependencies between features, but SVMs can learn them automatically using a kernel function.
Therefore, SVMs could learn more complex and effective model than ME modeling.
7 Conclusion
In this paper, we proposed a learning and extracting method of bilingual word sequence correspondences from non-aligned parallel corpora with SVMs.
Our method used features for the translation model which use the translation dictionary, the number of words, the part-of-speech, constituent words and neighbor words.
Experiment results in which Japanese and English parallel corpora are used archived 81.1 % precision rate and 69.0 % recall rate of the extracted translation pairs.
This demonstrates that our method could reduce the cost for making translation dictionaries.
Acknowledgments
We would like to thank Nihon Keizai Shimbun Inc. for giving us the research application permission of the English Business Letter Example Collection.
