Unknown words are a well-known hindrance to natural language applications.
In particular, they drastically impact machine translation quality.
An easy way out commercial translation systems usually offer their users is the possibility to add unknown words and their translations into a dedicated lexicon.
Recently, Stroppa and Yvon (2005) have shown how analogical learning alone deals nicely with morphology in different languages.
In this study we show that analogical learning offers as well an elegant and effective solution to the problem of identifying potential translations of unknown words.
1 Introduction
Analogical reasoning has received some attention in cognitive science and artificial intelligence (Gentner et al., 2001).
It has been for a long time a faculty assessed in the so-called SAT Reasoning tests used in the application process to colleges and universities in the United States.
Turney (2006) has shown that it is possible to compute relational similarities in a corpus in order to solve 56% of typical analogical tests quizzed in SAT exams.
The interested reader can find in (Lepage, 2003) a particularly dense treatment of analogy, including a fascinating chapter on the history of the notion of analogy.
The concept of proportional analogy, denoted [A : B = C : D], is a relation between four entities which reads: "A is to B as C is to D".
Among proportional analogies, we distinguish formal analogies, that is, ones that arise at the graphical level, such as [fournit : fleurit = fournie : fleurie] in French or [believer : unbelievable = doer : undoable] in English.
Formal analogies are
often good indices for deeper analogies (Stroppa and
Yvon, 2005).
Lepage and Denoual (2005) presented the system ALEPH, an intriguing example-based system entirely built on top of an automatic formal analogy solver.
This system has achieved state-of-the-art performance on the IWSLT task (Eck and Hori, 2005), despite its striking purity.
As a matter of fact, ALEPH requires no distances between examples, nor any threshold.1 It does not even rely on a tokenization device.
One reason for its success probably lies in the specificity of the BTEC corpus: short and simple sentences of a narrow domain.
It is doubtful that ALEPH would still behave adequately on broader tasks, such as translating news articles.
Stroppa and Yvon (2005) propose a very helpful algebraic description of a formal analogy and describe the theoretical foundations of analogical learning which we will recap shortly.
They show both its elegance and efficiency on two morphological analysis tasks for three different languages.
Recently, Moreau et al. (2007) showed that formal analogies of a simple kind (those involving suffixation and/or prefixation) offer an effective way to extend queries for improved information retrieval.
In this study, we show that analogical learning can be used as an effictive method for translating unknown words or phrases.
We found that our approach has the potential to propose a valid translation for 80% of ordinary unknown words, that is, words that are not proper names, compound words, or numerical expressions.
Specific solutions have been proposed for those token types (Chen et al., 1998; Al-Onaizan and Knight, 2002; Koehn and Knight, 2003).
The paper is organized as follows.
We first recall
1Some heuristics are applied for speeding up the system.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 877-886, Prague, June 2007.
©2007 Association for Computational Linguistics
in Section 2 the principle of analogical learning and describe how it can be applied to the task of enriching a bilingual lexicon.
In Section 3, we present the corpora we used in our experiments.
We evaluate our approach over two translation tasks in Section 4.
We discuss related work in Section 5 and give perspectives of our work in Section 6.
2 Analogical Learning 2.1 Principle
Our approach to bilingual lexical enrichment is an instance of analogical learning described in (Stroppa and Yvon, 2005).
A learning set L = [L\,..., LN} gathers N observations.
A set of features computed on an incomplete observation X defines an input space.
The inference task consists in predicting the missing features which belong to an output space.
We denote I(X) (resp.
O(X)) the projection of X into the input (resp. output) space.
The inference procedure involves three steps:
Building Sj(X) = {(A,B,C) e L3 | [I(A) : I(B) = I(C) : I(X)]}, the set of input stems2 of X, that is the set of triplets (A, B, C) which form with X an analogical equation.
Building So(X) = {Y | [O(A) : O(B) = O(C) : Y], V(A, B, C) e Si(X)} the set of solutions to the analogical equations obtained by projecting the stems of £j(X) into the output space.
So (X).
This inference procedure shares similarities with the K-nearest-neighbor (k-NN) approach.
In particular, since no model of the training material is being learned, the training corpus needs to be stored in order to be queried.
On the contrary to k-NN, however, the search for closest neighbors does not require any distance, but instead relies on relational similarities.
This purity has a cost: while in k-NN inference, neighbors can be found in time linear to the training size, in analogical learning, this operation requires a computation time cubic in N, the
2In Turney's work (Turney, 2006), a stem designates the first two words of a proportional analogy.
number of observations.
In many applications of interest, including the one we tackle here, this is simply impractical and heuristics must be applied.
The first and second steps of the inference procedure rely on the existence of an analogical solver, which we sketch in the next section.
One important thing to note at this stage, is that an analogical equation may have several solutions, some being legitimate word-forms in a given language, others being not.
Thus, it is important to select wisely the generated solutions, therefore Step 3.
In practice, the inference procedure involves the computation of many analogical equations, and a statistic as simple as the frequency of a solution often suffices to separate good from spurious solutions.
2.2 Analogical Solver
Lepage (1998) proposed an algorithm for computing the solutions of a formal analogical equation [A : B = C : ?].
We implemented a variant of this algorithm which requires to compute two edit-distance tables, one between A and B and one between A and C. Since we are looking for subsequences of B and C not present in A, insertion cost is null.
Once this is done, the algorithm synchronizes the alignments defined by the paths of minimum cost in each table.
Intuitively, the synchronization of two alignments (one between A and B, and one between A and C) consists in composing in the correct order subsequences of the strings B and C that are not in A. We refer the reader to (Lepage, 1998) for the intricacies of this process which is illustrated in Figure 1 for the analogical equation [even : usual = unevenly : ?].
In this example, there are 681 different paths that align even and usual (with a cost of 4), and 1 path which aligns even with unevenly (with a cost of 0).
This results in 681 synchronizations which generate 15 different solutions, among which only unusually is a legitimate word-form.
source (French) stems
Figure 1: The top table reports the edit-distance tables computed between even and usual (left part), and even and unevenly (right part).
The bottom part of the figure shows 2 of the 681 synchronizations computed while solving the equation [even : usual = unevenly : ?].
The first one corresponds to the path marked in bold italics and leads to a spurious solution; the second leads to a legitimate solution and corresponds to the path shown as squares.
and the second one corresponds to the maximum time needed to synchronize them.
| X| denotes the length, counted in characters of the string X, whilst ins(B, C) stands for the number of characters of B and C not belonging to A. Given the typical length of the strings we consider in this study, our solver is quite efficient.3
Stroppa and Yvon (2005) described a generalization of this algorithm which can solve a formal analogical equation by composing two finite-state transducers.
2.3 Application to Lexical Enrichment
Analogical inference can be applied to the task of extending an existing bilingual lexicon (or transfer table) with new entries.
In this study, we focus on a particular enrichment task: the one of translating valid words or phrases that were not encountered at training time.
A simple example of how our approach translates unknown words is illustrated in Figure 2 for the (un-
[activités : activité = futilités : futilité] [hostilités : hostilité = futilités : futilité]
projection by lexicon look-up
activités ^actions hostilités ^hostilitiés futilités <-> trivialitiés,gimmicks
hostilité ^hostility activité^action
target (English) resolution
selection of target candidates
(triviality, 2), (gimmick, 1), ...
Figure 2: Illustration of the analogical inference procedure applied to the translation of the unknown French word futilite.
known) French word futilite.
In this example, translations is inferred by commuting plural and singular words.
The inference process lazily captures the fact that English plural nouns ending in -ies usually correspond to singular nouns ending in -y.
Given an unknown source word-form S, Step 1 of the inference process consists in identifying source stems which have S as a solution:4
Si(S) = {<*,;•,*) e [1,N]3 | [Sj : Sj = Sfc : S]}.
During Step 2a, each source stem belonging to fx(S) is projected form by form into (potentially several) stems in the output space, thanks to an operator proj that will be defined shortly:
(U, V, W) e (proxc(Sj) x proj£(Sj) x proj£(Sfc)).
3Several thousands of equations solved within one second.
4All strings in a stem must be different, otherwise, it can be shown that all source words would be considered.
During Step 2b, each solution to those output stems is collected in SO (S) along with its associated frequency:
fo(S) = U S(i j ,k >(S).
Step 3 selects from SO (S) one or several solutions.
We use frequency as criteria to sort the generated solutions.
The projection mechanism we resort to in this study simply is a lexicon look-up:
proj£(S) = {T |(S,T)eL}.
There are several situations where this inference procedure will introduce noise.
First, both source and target analogical equations can lead to spurious solutions.
For instance, [show : showing = eating : ?] will erroneously produce eatinging.
Second, an error in the original lexicon may introduce as well erroneous target word-forms.
For instance, when translating the German word proklamierung, by making use of the analogy [formalisiert : formalisierung = proklamiert : proklamierung], the English equation [formalised : formalized = sets : ?] will be considered if it happens that proklamiert^sets belongs to L; in which case, zets will be erroneously produced.
We control noise in several ways.
The source word-forms we generate are filtered by imposing that they belong to the input space.
We also use a (large) target vocabulary to eliminate spurious target word-forms (see Section 3).
More importantly, since we consider many analogical equations when translating a word-form, spurious analogical solutions tend to appear less frequently than ones arising from paradigmatic commutations.
2.4 Practical Considerations
Searching for Sx(S) is an operation which requires solving a number of (source) analogical equations cubic in the size of the input space.
In many settings of interest, including ours, this is simply not practical.
We therefore resort to two strategies to reduce computation time.
The first one consists in using the analogical equations in a generative mode.
Instead of searching through the set of stems (Sj,Sj,Sk) that have for solution the unknown source wordform S, we search for all pairs (Sj, Sj) to the solutions of [Sj : Sj = S :?] that are valid word-forms
This leaves us with a quadratic computation time which is still intractable in our case.
Therefore, we apply a second strategy which consists in computing the analogical equations [Sj : Sj = S :?] for the only words Sj and Sj close enough to S. More precisely, we enforce that Sj e v<s (S) and that Sj e (Sj) for a neighborhood function vY(A) of the form:
where f is a distance; we used the edit-distance in this study (Levenshtein, 1966).
Note that the second strategy we apply is only a heuristic.
3 Resources
In this work, we are concerned with one concrete problem a machine translation system must face: the one of translating unknown words.
We are further focusing on the shared task of the workshop on Statistical Machine Translation, which took place last year (Koehn and Monz, 2006) and consisted in translating Spanish, German, and French texts from and to English.
For some reasons, we restricted ourselves to translating only into English.
The training material available is coming from the Europarl corpus.
The test material was divided into two parts.5 The first one (hereafter called test-in) is composed of 2 000 sentences from European parliament debates.
The second part (called test-out) gathers 1 064 sentences6 collected from editorials of the Project Syndicate website.7 The main statistics pertinent to our study are summarized in Table 1.
5The participants were not aware ofthis.
6We removed 30 sentences which had encoding problems.
7http://www.project-syndicate.com
I unknown I
Table 1: Number of different (source) test words not seen at training time, and out-of-vocabulary rate expressed as a percentage (oov%).
words), 7 words are acronyms, and 4 are tokeniza-tion problems.
The 238 other words (54%) are ordinary words.
We considered different lexicons for testing our approach.
These lexicons were derived from the training material of the shared task by training with Giza++ (Och and Ney, 2000) —default settings— two transfer tables (source-to-target and the reverse) that we intersected to remove some noise.
In order to investigate how sensitive our approach is to the amount of training material available, we varied the size of our lexicon LT by considering different portions of the training corpus (t = 5 000, 10 000, 100 000, 200 000, and 500 000 pairs of sentences).
The lexicon trained on the full training material (688 000 pairs of sentences), called Lref hereafter, is used for validation purposes.
We kept (at most) the 20 best associations of each source word in these lexicons.
In practice, because we intersect two models, the average number of translations kept for each source word is lower (see Table 2).
Last, we collected from various target texts (English here) we had at our disposal, a vocabulary set V gathering 466 439 words, that we used to filter out spurious word-forms generated by our approach.
4 Experiments
4.1 Translating Unknown Words
For the three translation directions (from Spanish, German, and French into English), we applied the analogical reasoning to translate the (non-numerical) source words of the test material, absent from LT.
Examples of translations produced by analogical inference are reported in Figure 3, sorted by decreasing order of times they have been generated.
Figure 3: Candidate translations inferred from L200000 and their frequency.
The candidates reported are those that have been intersected with V. Translations in bold are clearly erroneous.
We devised two baselines against which we compared our approach (hereafter analog).
The first one, base1, simply proposes as translations the target words in the lexicon LT which are the most similar (in the sense of the edit-distance) to the unknown source word.
Naturally, this approach is only appropriate for pairs of languages that share many cognates (i.e., docteur — doctor).
The second baseline, base2, is more sensible and more closely corresponds to our approach.
We first collect a set of source words that are close-enough (according to the edit-distance) to the unknown word.
Those source words are then projected into the output space by simple bilingual lexicon look-up.
So for instance, the French word demanda will be translated into the English word request if the French word demande is in LT and that request is one of its sanctioned translations.
Each of these baselines is tested in two variants.
The first one (id), which allows a direct comparison, proposes as many translations as analog does.
The second one (i0) proposes the first 10 translations of each unknown word.
Evaluating the quality of translations requires to inspect lists of words each time we want to test a variant of our approach.
This cumbersome process not only requires to understand the source language,
test-out
analog base1id base2id
Table 2: Performance of the different approaches on the French-to-English direction as a function of the number T of pairs of sentences used for training LT.
A pair [n, t] in lines labeled by unk stands for the number of words to translate, and the average number of their translations in Lref.
but happens to be in practice a delicate task.
We therefore decided to resort to an automatic evaluation procedure which relies on Lref, a bilingual lexicon which entries are considered correct.
We translated all the words of Lref absent from LT.
We evaluated the different approaches by computing response and precision rates.
The response rate is measured as the percentage of words for which we do have at least one translation produced (correct or not).
The precision is computed in our case as the percentage of words for which at least one translation is sanctioned by Lref.
Note that this way of measuring response and precision is clearly biased toward translation systems that can hypothesize several candidate translations for each word, as statistical systems usually do.
The reason of this choice was however guided by a lack of precision of the reference we anticipated, a point we discuss in Section 4.1.3.
The figures for the French-to-English direction are reported in Table 2.
We observe that the ratio of unknown words that get a translation by analog is clearly impacted by the size of the lexicon LT we use for computing analogies: the larger the better.
This was expected since the larger a lexicon is, the higher the number of source analogies that
can be made and consequently, the higher the number of analogies that can be projected onto the output space.
The precision of analog is rather stable across variants and ranges between 50% to 60%.
The second observation we make is that the baselines perform worse than analog in all but the L500000 cases.
Since our baselines propose translations to each source word, their response rate is maximum.
Their precision, however, is an issue.
Expectedly, base1 is the worst ofthe two baselines.
If we arbitrarily fix the response rate of base2 to the one of ana lo g, the former approach shows a far lower precision (e.g., 34.4 against 59.4 for L200000).
This not only indicates that analogical learning is handling unknown words better than base2, but as well, that a combination of both approaches could potentially yield further improvements.
A last observation concerns the fact that analog performs equally well on the out-domain material.
This is very important from a practical point of view and contrasts with some related work we discuss in Section 5.
At first glance, the fact that base2 outperforms analog on the larger training size is disappointing.
After investigations, we came to the conclusion that this is mainly due to two facts.
First, the num-
ber of unknown words on which both systems were tested is rather low in this particular case (e.g., 34 for the in-domain corpus).
Second, we noticed a deficiency of the reference lexicon Lref for many of those words.
After all, this is not surprising since the words unseen in the 500 000 pairs of training sentences, but encountered in the full training corpus (688 000 pairs) are likely to be observed only a few times, therefore weakening the associations automatically acquired for these entries.
We evaluate that a third of the reference translations were wrong in this setting, which clearly raises some doubts on our automatic evaluation procedure in this case.
The performance of analog across the three language pairs are reported in Table 3.
We observe a drop of performance of roughly 10% (both in precision and response) for the German-to-English translation direction.
This is likely due to the heuristic procedure we apply during the search for stems, which is not especially well suited for handling compound words that are frequent in German.
We observe that for Spanish- and German-to-English translation directions, the precision rate tends to decrease for larger values of t. One explanation for that is that we consider all analogies equally likely in this work, while we clearly noted that some are spurious ones.
With larger training material, spurious analogies become more likely.
Table 3: Performance across language pairs measured on test-in.
The number t of pairs of sentences used for training LT is reported in thousands.
We measured the impact the translations produced by analog have on a state-of-the-art phrase-based translation engine, which is described in (Patry et al., 2006).
For that purpose, we extended a phrase-table with the first translation proposed by analog or base2 for each unknown word of the test material.
Results in terms of word-error-rate (wer)
and bleu score (Papineni et al., 2002) are reported in Table 4 for those sentences that contain at least one unknown word.
Small but consistent improvements are observed for both metrics with analog.
This was expected, since the original system simply leaves the unknown words untranslated.
What is more surprising is that the base2 version slightly underperforms the baseline.
The reason is that some unknown words that should appear unmodified in a translation, often get an erroneous translation by base2.
Forcing base2 to propose a translation for the same words for which analog found one, slightly improves the figures (base2id).
wer bleu
sentences
Table 4: Translation quality produced by our phrase-based SMT engine (base) with and without the first translation produced by analog, base2, or base2i(d for each unknown word.
As we already mentioned, the lexicon used as a reference in our automatic evaluation procedure is not perfect, especially for low frequency words.
We further noted that several words receive valid translations that are not sanctioned by Lref.
This is for instance the case of the examples in Figure 4, where circumventing and fellow are arguably legitimate translations of the French words contournant and concitoyen, respectively.
Note that in the second example, the reference translation is in the plural form while the French word is not.
Therefore, we conducted a manual evaluation of the translations produced from L100000 by analog and base2 on the 127 French words of the corpus test-in8 unknown of Lref.
Those are the non-numerical unknown words the participating systems in the shared task had to face in the
8We did not notice important differences between test-in and test-out.
contournant (49 candidates)
Lref o skirting, bypassing, by-pass, overcoming concitoyen (24 candidates)
Lref o fellow-citizens
Figure 4: 10 best ranked candidate translations produced by analog from L200000 for two unknown words and their sanctioned translations in Lref.
Words in bold are present in both the candidate and the reference lists.
in-domain part of the test material.
75 (60%) of those words received at least one valid translation by analog while only 63 (50%) did by base2.
Among those words that received (at least) one valid translation, 61 (81%) were ranked first by analog against only 22 (35%) by base2.
We further observed that among the 52 words that did not receive a valid translation by analog, 38 (73%) did not receive a translation at all.
Those untranslated words are mainly proper names (bush), foreign words (munere), and compound words (rhenanie-du-nord-westphalie), for which our approach is not especially well suited.
We conclude from this informal evaluation that 80% of ordinary unknown words received a valid translation in our French-to-English experiment, and that roughly the same percentage had a valid translation proposed in the first place by analog.
4.2 Translating Unknown Phrases
Our approach is not limited to translate solely unknown words, but might serve as well to enrich existing entries in a lexicon.
For instance, low-frequency words, often poorly handled by current statistical methods, could receive useful translations.
This is illustrated in Figure 5 where we report the best candidates produced by analog for the French word invitees, which appears 7 times in the 200 000
Figure 5: 10 best candidates produced by analog for the low-frequency French word invitees and its translations in L200000.
first pairs of the training corpus.
Interestingly, analog produced the candidate guest which corresponds to a legitimate meaning of the French word that was absent in the training data.
Because it can treat separators as any other character, analog is not bounded to translate only words.
As a proof of concept, we applied analogical reasoning to translate those source sequences of at most 5 words in the test material that contain an unknown word.
Since there are many more sequences than there are words, the input space in this experiment is far larger, and we had to resort to a much more aggressive pruning technique to find the stems of the sequences to be translated.
Figure 6: Examples of translations produced by analog where the input (resp. output) space is defined by the set of source (resp. target) word sequences.
Words in bold are unknown.
We applied the automatic evaluation procedure described in Section 4.1.2 for the French-to-English translation direction, with a reference lexicon being this time the phrase table acquired on the full training material.9 The response rate in this experiment is particularly low since only a tenth of the sequences
9This model contains 1.5 millions pairs of phrases.
received (at least) a translation by analog.
Those are short sequences that contain at most three words, which clearly indicates the limitation of our pruning strategy.
Among those sequences that received at least one translation, the precision rate is 55%, which is consistent with the rate we measured while translating words.
Examples of translations are reported in Figure 6.
We observe that single words are not contrived anymore to be translated by a single word.
This allows to capture 1:n relations such as depasseront<->will exceed, where the future tense of the French word is adequately rendered by the modal will in English.
5 Related Work
We are not the first to consider the translation of unknown words or phrases.
Several authors have for instance proposed approaches for translating proper names and named entities (Chen et al., 1998; Al-Onaizan and Knight, 2002).
Our approach is complementary to those ones.
Recently and more closely related to the approach we described, Callison-Burch et al. (2006) proposed to replace an unknown phrase in a source sentence by a paraphrase.
Paraphrases in their work are acquired thanks to a word alignment computed over a large external set of bitexts.
One important difference between their work and ours is that our approach does not require additional material.10 Indeed, they used a rather idealistic set of large, homogeneous bitexts (European parliament debates) to acquire paraphrases from.
Therefore we feel our approach is more suited for translating "low density" languages and languages with a rich morphology.
Several authors considered as well the translation of new words by relying on distributional collocational properties computed from a huge non-parallel corpus (Rapp, 1999; Fung and Yee, 1998; Takaaki
if admittedly non-parallel corpora are easier to acquire than bitexts, this line of work is still heavily dependent on huge external resources.
Most of the analogies made at the word level in our study are capturing morphological information.
10We do use a target vocabulary list to filter out spurious analogies, but we believe we could do without.
The frequency with which we generate a string could serve to decide upon its legitimacy.
The use of morphological analysis in (statistical) machine translation has been the focus of several studies, (NieBen, 2002) among the first.
Depending on the pairs of languages considered, gains have been reported when the training material is of modest size (Lee, 2004; Popovic and Ney, 2004; Gold-water and McClosky, 2005).
Our approach does not require any morphological knowledge of the source, the target, or both languages.
Admittedly, several unsupervised morphological induction methodologies have been proposed, e.g., the recent approach in Freitag (2005).
In any case, as we have shown, analog is not bounded to treat only words, which we believe to be at our advantage.
6 Discussion and Future Work
In this paper, we have investigated the appropriateness of analogical learning to handle unknown words in machine translation.
On the contrary to several lines of work, our approach does not rely on massive additional resources but capitalizes instead on an information which is inherently pertaining to the language.
We measured that roughly 80% of ordinary unknown French words can receive a valid translation into English with our approach.
This work is currently being developed in several directions.
First, we are investigating why our approach remains silent for some words or phrases.
This will allow us to better characterize the limitations of analog and will hopefully lead us to design a better strategy for identifying the stems of a given word or phrase.
Second, we are investigating how a systematic enrichment of a phrase-transfer table will impact a phrase-based statistical machine translation engine.
Last, we want to investigate the training of a model that can learn regularities from the analogies we are making.
This would relieve us from requiring the training material while translating, and would allow us to compare our approach with other methods proposed for unsupervised morphology acquisition.
Acknowledgement We are grateful to the anonymous reviewers for their useful suggestions and to Pierre Poulin for his fruitful comments.
This study has been partially funded by NSERC.
