We present a method for learning to find English to Chinese transliterations on the Web.
In our approach, proper nouns are expanded into new queries aimed at maximizing the probability of retrieving transliterations from existing search engines.
The method involves learning the sublexical relationships between names and their transliterations.
At run-time, a given name is automatically extended into queries with relevant morphemes, and transliterations in the returned search snippets are extracted and ranked.
We present a new system, TermMine, that applies the method to find transliterations of a given name.
Evaluation on a list of 500 proper names shows that the method achieves high precision and recall, and outperforms commercial machine translation systems.
1 Introduction
Increasingly, short passages or web pages are being translated by desktop machine translation software or are submitted to machine translation services on the Web every day.
These texts usually contain some proportion of proper names (e.g., place and people names in "The cities of Mesopotamia prospered under Parthian and Sassanian rule."), which may not be handled properly by a machine translation system.
Online machine translation services such as Google Translate1 or Yahoo!
Babelfish2 typically use a bilingual dictionary that is either manually compiled or learned from a par-
1 Google Translate: translate.google.com/translate_t
2 Yahoo!
Babelfish: babelfish.yahoo.com
Jason S. Chang
Department of Computer Science National Tsing Hua University 101, Kuangfu Road, Hsinchu, Taiwan jschang@cs.nthu.edu.tw
allel corpus.
However, such dictionaries often have insufficient coverage of proper names and technical terms, leading to poor translation performance due to out of vocabulary (OOV) problem.
Handling name transliteration is also important for cross language information retrieval (CLIR) and terminology translation (Quah 2006).
There are also services on the Web specifically targeting transliteration aimed at improving CLIR, including
Chien, and Lee 2004).
The OOV problems of machine translation (MT) or CLIR can be handled more effectively by learning to find transliteration on the Web.
Consider the sentence in Example (1), containing three proper names.
Google Translate produces the sentence in Example (2) and leaves "Parthian" and "Sassanian" not translated.
A good response might be a translation like Example (3) with appropriate transliterations (underlined).
(1) The cities of Mesopotamia prospered under Parthian and Sassanian rule.
These transliterations can be more effectively retrieved from mixed-code Web pages by extending each of the proper names into a query.
Intuitively, by requiring one of likely transliteration morphemes (e.g., "G"(Ba) or "lfjQ"(Pa) for names beginning with the prefix "par-"), we can bias the search engine towards retrieving the correct trans-
3 Jt^^gl^l!]j(Meisuobudamiya) is the transliteration of "Mesopotamia."
5 jUIffiKSashan) is the transliteration of "Sassanian."
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 996-1004, Prague, June 2007.
©2007 Association for Computational Linguistics
Figure 1.
An example of TermMine search for transliterations of the name "Parthian"
literations (e.g., " BJgiE "(Badiya) and " iffggl 55"(Patiya)) in snippets of many top-ranked documents.
This approach to terminology translation by searching is a strategy increasingly adopted by human translators.
Quah (2006) described a modern day translator would search for the translation of a difficult technical term such as "||^tty|11l fifflf§7^f/l/A" by expanding the query with the word "film" (back transliteration of the component of the term in question).
This kind of query expansion (QE) indeed increases the chance of finding the correct translation "anisotropic conductive film" in top-ranked snippets.
However, the manual process of expanding query, sending search request, and extracting transliteration is tedious and time consuming.
Furthermore, unless the query expansion is done properly, snippets containing answers might not be ranked high enough for this strategy to be the most effective.
We present a new system, TermMine, that automatically learns to extend a given name into a query expected to retrieve and extract transliterations of the proper name.
An example of machine transliteration of "Parthian" is shown in Figure 1.
TermMine has determined the best 10 query expansions (e.g., "Parthian G ," "Parthian iffg ").
TermMine learns these effective expansions auto-
matically during training by analyzing a collection of place names and their transliterations, and deriving cross-language relationships of prefix and postfix morphemes.
For instance, TermMine learns that a name that begins with the prefix "par-" is likely to have a transliteration beginning with " E" or "l ").
We describe the learning process in Section
This prototype demonstrates a novel method for learning to find transliterations of proper nouns on the Web based on query expansion aimed at maximizing the probability of retrieving transliterations from existing search engines.
Since the method involves learning the morphological relationships between names and their transliterations, we refer to this IR-based approach as morphological query expansion approach to machine transliteration.
This novel approach is general in scope and can also be applied to back transliteration and to translation with slight modifications, even though we focus on transliteration in this paper.
The remainder of the paper is organized as follows.
First, we give a formal statement for the problem (Section 2).
Then, we present a solution to the problem by proposing new transliteration probability functions, describing the procedure for estimating parameters for these functions (Section 3) and the run-time procedure for searching and ex-
tracting transliteration via a search engine (Section 4).
As part of our evaluation, we carry out two sets of experiments, with or without query expansion, and compare the results.
We also evaluate the results against two commercial machine translation online services (Section 5).
2 Problem Statement
Using online machine translation services for name transliteration does not work very well.
Searching in the vicinity of the name in mixed-code Web pages is a good strategy.
However, query expansion is needed for this strategy to be effective.
Therefore, to find transliterations of a name, a promising approach is to automatically expand the given name into a query with the additional requirement of some morpheme expected to be part of relevant transliterations that might appear on the
Web.
Table 1.
Sample name-transliteration pairs from the
Transliteration
Aabenraa
Aardenburg
Aalesund
Abacaxis
Now, we formally state the problem we are dealing with:
While a proper name N is given.
Our goal is to search and extract the transliteration T of N from Web pages via a general-purpose search engine SE.
For that, we expand N into a set of queries q1, q2, qm, such that the top n document snippets returned by SE for the queries are likely to contain some transliterations T of the given name N.
In the next section, we propose using a probability function to model the relationships between names and transliterations and describe how the parameters in this function can be estimated.
3 Learning Relationships for QE
We attempt to derive cross-language morphological relationships between names and transliterations and use them to expand a name into an effective query for searching and extracting transliterations.
For the purpose of expanding the given name, N, into effective queries to search and extract transliterations T, we define a probabilistic function for mapping prefix syllable from the source to the target languages.
The prefix transliteration function P(TP | NP) is the probability of T has a prefix TP under the condition that the name N has a prefix NP.
where Count (TP,NP) is the number of TP and NP co-occurring in the pairs of training set (see Table 1), and Count(NP) is the number of NP occurring in training set.
The prefixes and postfixes are intended as a syllable in the two languages involved, so the two prefixes correspond to each other (See Table 2&3).
Due to the differences in the sound inventory, the Roman prefix corresponding to a syllabic prefix in Chinese may vary, ranging from a consonant, a vowel, or a consonant followed by a vowel (but not a vowel followed by a consonant).
So, it is likely such a Roman prefix has from one to four letters.
On the contrary, the prefix syllable for a name written in Chinese is readily identifiable.
Table 2.
Sample cross-language morphological relation-
Transliteration Prefix (TP)
Np Count
Tp Count
Table 3.
Sample cross-language morphological relationships between postfixes.___
Transliteration Postfix (Ts)
Ns Count
Co-occ.
Count
We also observe that a preferred prefix (e.g., "3c"(Ai)) is often used for a Roman prefix (e.g., "a-" or "ir-"), while occasionally other homo-phonic characters are used (e.g., "±j|"(Ai)).
The skew distribution creates problems for reliable estimation of transliteration functions.
To cope with this data sparseness problem, we use homophone classes and a function CL that maps homophonic characters to the same class number.
For instance, "3c" and "±j|" are homophonic, and both are assigned the same class identifier(see Table 4 for more samples).
Therefore, we have
CL ("3") = CL ("#|") = 275.
Table 4.
Some examples of classes of homophonic characters.
The class ID of each class is assigned arbitrarily._
Transl.
With homophonic classes of transliteration morphemes, we define class-based transliteration probability as follows
With class-based transliteration probabilities, we are able to cope with difficulty in estimating parameters for rare events which are under represented in the training set.
Table 5 shows that "±j|" belongs to a homophonic class co-occurring with "a-" for 46 times, even when only one instance of
^ , a- ).
After cross-language relationships for prefixes and postfixes are automatically trained, the prefix relationships are stored as prioritized query expansion rules.
In addition to that, we also need a transliteration probability function to rank candidate transliterations at run-time (Section 4).
To cope with data sparseness, we consider names (or transliterations) with the same prefix (or postfix) as a class.
With that in mind, we use both prefix and postfix to formulate an interpolation-based estimator for name transliteration probability:
where X1 + X2 = 1 and NP, NS, TP, and TS are the prefix and postfix of the given name N and transliteration T.
For instance, the probability of " H^^H;^ 55 "(Meisuobudamiya) as a transliteration of "Mesopotamia" is estimated as follows
(1) For each entry in the bilingual name list, pair up prefixes and postfixes in names and transliterations.
(2) Calculate counts of these affixes and their cooccurrences.
(3) Estimate the prefix and postfix transliteration functions
(4) Estimate class-based prefix and postfix transliteration functions_
Figure 2.
Outline of the process used to train the TermMine system.
The system follows the procedure shown in Figure 2 to estimate these probabilities.
In Step (1),
the system generates all possible prefix pairs for each name-transliteration pair.
For instance, consider the pair, ("Aabenraa," the system will generate eight pairs:
(a-, H-), (aa-, H-), (aab-, H-), (aabe-, H-), (-a, -&), (-aa, -&), (-raa, -&), and (-nraa, -&).
Finally, the transliteration probabilities are estimated based on the counts of prefixes, postfixes, and their co-occurrences.
The derived probabilities embody a number of relationships:
4 Transliteration Search and Extraction
At run-time, the system follows the procedure in Figure 3 to process the given name.
In Step (1), the system looks up in the prefix relationship table to find the n best relationships (n = MaxExpQueries) for query expansion with preference for relationships with higher probabilistic value.
For instance, to search for transliterations of "Acton," the system looks at all possible prefixes and postfixes of "Acton," including a-, ac-, act-, acto-, -n, -on, -ton, and -cton, and determines the best query expansions: "Acton H," "Acton S," "Acton 3c," "Acton jj," "Acton JfH," etc. These effective expansions are automatically derived during the training stage described in Section 3 by analyzing a large collection of name-transliteration pairs.
In Step (2), the system sends off each of these queries to a search engine to retrieve up to MaxDocRetrieved document snippets.
In Step (3), the system discards snippets that have too little proportion of target-language text.
See Example (4) for a snippet that has high portion of English text and therefore is less likely to contain a transliteration.
In Step (4), the system considers the substrings in the remaining snippets.
(1) Look up the table for top MaxExpQueries prefix and posfix relationships relevant to the given name and use the target morphemes in the relationship to form expanded queries
(2) Search for Web pages with the queries and filter out snippets containing at less than MinTargetRate portion of target language text
(3) Evaluate candidates based on class-based transliteration probability (Equation 5)
(4) Output top one candidate for evaluation Figure 3.
Outline of the steps used to search, extract, and rank transliterations.
Table 5.
Sample data for class-based morphological transliteration probability of prefixes, where # of NP denotes the number of the name prefix NP; # of C, NP denotes the number of all Tp belonging to the class C co-occurring with the NP; # TP, NP denotes the number of transliteration prefix Tp co-occurs with the Np; P(C|Np) denotes the probability of all Tp belonging to C co-occurring with the Np; P(Tp|Np) denotes the probability
Table 6.
Sample data for class-based morphological transliteration probability of postfixes.
Notations are similar to those for Table 5.
Class ID
http://www.hkmassive.com/forum/viewthread.php? tid=2368&fpage=1 Watch the slide show! ...
(5) New Home Alert - Sing Tao ^ New Homes Please select, Acton HjSJJi, Ajax Sjjjrdr, Allis-ton HMMJS, Ancaster ^3?
#, Arthur MM, Aurora MMMl, Ayr 3ciSt, Barrie EM, Beamsville, Belleville ...
Acton| Systems is a world leading manufacturer supplying stuctured cabling systems suited to the Australian and New Zealand marketplace.
M$N|33
t^MMR&MM 0 Custom made leads are now available ...
The occurrence counts and average distance from instances of the given name are tallied for each of these candidates.
Candidates with a low occurrence count and long average distance are excluded from further consideration.
Finally, all candidates are evaluated and ranked using Equation (7) given in Section 3.
5 Evaluation
In the experiment carried out to assess the feasibility to the proposed method, a data set of 23,615 names and transliterations was used.
This set of place name data is available from NICT, Taiwan for training and testing.
There are 967 distinct Chinese characters presented in the data, and more details of training data are available in Table 7.
The English part consists of Romanized versions of names originated from many languages, including Western and Asian languages.
Most of the time, the names come with a Chinese counterpart based solely on transliteration.
But occasionally, the Chinese counterpart is part translation and part transliteration.
For instance, the city of "Southampton" has a Chinese counterpart consisting of " " (translation of "south") and "MllfJJi" (transliteration of "ampton").
Type of Data Used in Experiment
Name-transliteration pairs
Training data
Test data
Distinct transliteration morphemes
Distinct transliteration morphemes (80% coverage)
Names with part translation and part transliteration (estimated)
Cross-language prefix relationships
Cross-language postfix relationships
We used the set of parameters shown in Table 8 to train and run System TermMine.
A set of 500 randomly selected were set aside for testing.
We paired up the prefixes and postfixes in the remaining 23,116 pairs, by taking one to four leading or trailing letters of each Romanized place names and the first and last Chinese transliteration character to estimate P (Tp | Np) and P (Ts | Ns).
Parameter
Description
MaxPrefixLetters
Max number of letters in a prefix
MaxPostfixLetters
Max number of letters in a postfix
MaxExpQueries
Max number of expanded queries
MaxDocRetrieved
Max number of document retrieved
MinTargetRate
Min rate of target text in a snippet
MinOccCount
Min number of cooccurrence of query and transliteration candidate in snippets
MaxAvgDistance
Max distance between N and T
WeightPrefixProb
Weight of Prefix probability
WeightPostfixProb
Weight of Postfix probability (X2)
We carried out two kinds of evaluation on System TermMine, with and without query expansion.
With QE option off, the name itself was sent off as a query to the search engine, while with QE option turned on, up to 10 expanded queries were sent for each name.
We also evaluated the system against Google Translate and Yahoo!
Babelfish.
We discarded the results when the names are returned untranslated.
After that, we checked the correctness of all remaining results by hand.
Table 9 shows a sample of the results produced by the three systems.
In Table 10, we show performance differences of system TermMine in query expansion option.
Without QE, the system returns transliterations (applicability) less than 50% of the time.
Nevertheless, there are enough snippets for extracting and ranking of transliterations.
The precision rate of the top-ranking transliterations is 88%.
With QE turned on, the applicability rate increases significantly to 60%.
The precision rate also improved slightly to 0.89.
The performance evaluation of three systems is shown in Table 11.
For the test set of 500 place names, Google Translate returned 146 transliterations and Yahoo!
Babelfish returned only 44, while TermMine returned 300.
Of the returned transliterations, Google Translate and Yahoo!
Babelfish achieved a precision rate around 50%, while TermMine achieved a precision rate almost as high as 90%.
The results show that System TermMine outperforms both commercial MT systems by a wide margin, in the area of machine transliteration of proper names.
Table 9.
Sample output by three systems evaluated.
The
Palmerston
Cootamundra
Australasia
Inverness
Lomonosov
Oskaloosa
Table 10.
Performance evaluation of TermMine
TermMine QE-
# of cases performed
# Correct Answers
TermMine QE+
# of correct answers
Applicability
Precision
F-measure
TermMine
Google Translate
Yahoo!
Babelfish
Arlington
6 Comparison with Previous Work
Machine transliteration has been an area of active research.
Most of the machine transliteration method attempts to model the transliteration process of mapping between graphemes and phonemes.
Knight and Graehl (1998) proposed a multilayer model and a generate-and-test approach to perform back transliteration from Japanese to English based on the model.
In our work we address an issue of producing transliteration by way of search.
Onaizan and Knight (2002), and Oh et al. (2005).
Recently, some of the machine transliteration study has begun to consider the problem of extracting names and their transliterations from parallel corpora (Qu and Grefenstette 2004, Lin, Wu and Chang 2004; Lee and Chang 2003, Li and Grefen-
stette 2005).
Cao and Li (2002) described a new method for base noun phrase translation by using Web data.
Kwok, et al. (2001) described a system called CHINET for cross language name search.
Nagata et al. (2001) described how to exploit proximity and redundancy to extract translation for a given term.
Lu, Chien, and Lee (2002) describe a method for name translation based on mining of anchor texts.
More recently, Zhang, Huang, and Vogel (2005) proposed to use occurring words to expand queries for searching and extracting transliterations.
Oh and Isahara (2006) use phonetic-similarity to recognize transliteration pairs on the Web.
In contrast to previous work, we propose a simple method for extracting transliterations based on a statistical model trained automatically on a bilingual name list via unsupervised learning.
We also carried out experiments and evaluation of training and applying the proposed model to extract transliterations by using web as corpus.
7 Conclusion and Future Work
Morphological query expansion represents an innovative way to capture cross-language relations in name transliteration.
The method is independent of the bilingual lexicon content making it easy to adopt to other proper names such person, product, or organization names.
This approach is useful in a number of machine translation subtasks, including name transliteration, back transliteration, named entity translation, and terminology translation.
Many opportunities exist for future research and improvement of the proposed approach.
First, the method explored here can be extended as an alterative way to support such MT subtasks as back transliteration (Knight and Graehl 1998) and noun phrase translation (Koehn and Knight 2003).
Finally, for more challenging MT tasks, such as handling sentences, the improvement of translation quality probably will also be achieved by combining this IR-based approach and statistical machine translation.
For example, a pre-processing unit may replace the proper names in a sentence with transliterations (e.g., mixed code text "The cities of H
^^ll^Si prospered under EISS and WM rule." before sending it off to MT for final translation.
