We
present
a
method
for
learning
to
find
English
to
Chinese
transliterations
on
the
Web
.
In
our
approach
,
proper
nouns
are
expanded
into
new
queries
aimed
at
maximizing
the
probability
of
retrieving
transliterations
from
existing
search
engines
.
The
method
involves
learning
the
sublexical
relationships
between
names
and
their
transliterations
.
At
run-time
,
a
given
name
is
automatically
extended
into
queries
with
relevant
morphemes
,
and
transliterations
in
the
returned
search
snippets
are
extracted
and
ranked
.
We
present
a
new
system
,
TermMine
,
that
applies
the
method
to
find
transliterations
of
a
given
name
.
Evaluation
on
a
list
of
500
proper
names
shows
that
the
method
achieves
high
precision
and
recall
,
and
outperforms
commercial
machine
translation
systems
.
1
Introduction
Increasingly
,
short
passages
or
web
pages
are
being
translated
by
desktop
machine
translation
software
or
are
submitted
to
machine
translation
services
on
the
Web
every
day
.
These
texts
usually
contain
some
proportion
of
proper
names
(
e.g.
,
place
and
people
names
in
"
The
cities
of
Mesopotamia
prospered
under
Parthian
and
Sassanian
rule
.
"
)
,
which
may
not
be
handled
properly
by
a
machine
translation
system
.
Online
machine
translation
services
such
as
Google
Translate1
or
Yahoo
!
Babelfish2
typically
use
a
bilingual
dictionary
that
is
either
manually
compiled
or
learned
from
a
par
-
1
Google
Translate
:
translate.google.com
/
translate_t
2
Yahoo
!
Babelfish
:
babelfish.yahoo.com
Jason
S.
Chang
Department
of
Computer
Science
National
Tsing
Hua
University
101
,
Kuangfu
Road
,
Hsinchu
,
Taiwan
jschang
@
cs.nthu.edu.tw
allel
corpus
.
However
,
such
dictionaries
often
have
insufficient
coverage
of
proper
names
and
technical
terms
,
leading
to
poor
translation
performance
due
to
out
of
vocabulary
(
OOV
)
problem
.
Handling
name
transliteration
is
also
important
for
cross
language
information
retrieval
(
CLIR
)
and
terminology
translation
(
Quah
2006
)
.
There
are
also
services
on
the
Web
specifically
targeting
transliteration
aimed
at
improving
CLIR
,
including
Chien
,
and
Lee
2004
)
.
The
OOV
problems
of
machine
translation
(
MT
)
or
CLIR
can
be
handled
more
effectively
by
learning
to
find
transliteration
on
the
Web
.
Consider
the
sentence
in
Example
(
1
)
,
containing
three
proper
names
.
Google
Translate
produces
the
sentence
in
Example
(
2
)
and
leaves
"
Parthian
"
and
"
Sassanian
"
not
translated
.
A
good
response
might
be
a
translation
like
Example
(
3
)
with
appropriate
transliterations
(
underlined
)
.
(
1
)
The
cities
of
Mesopotamia
prospered
under
Parthian
and
Sassanian
rule
.
These
transliterations
can
be
more
effectively
retrieved
from
mixed-code
Web
pages
by
extending
each
of
the
proper
names
into
a
query
.
Intuitively
,
by
requiring
one
of
likely
transliteration
morphemes
(
e.g.
,
"
G
"
(
Ba
)
or
"
lfjQ
"
(
Pa
)
for
names
beginning
with
the
prefix
"
par
-
"
)
,
we
can
bias
the
search
engine
towards
retrieving
the
correct
trans
-
3
Jt^^gl^l
!
]
j
(
Meisuobudamiya
)
is
the
transliteration
of
"
Mesopotamia
.
"
5
jUIffiKSashan
)
is
the
transliteration
of
"
Sassanian
.
"
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
996-1004
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
Figure
1
.
An
example
of
TermMine
search
for
transliterations
of
the
name
"
Parthian
"
literations
(
e.g.
,
"
BJgiE
"
(
Badiya
)
and
"
iffggl
55
"
(
Patiya
)
)
in
snippets
of
many
top-ranked
documents
.
This
approach
to
terminology
translation
by
searching
is
a
strategy
increasingly
adopted
by
human
translators
.
Quah
(
2006
)
described
a
modern
day
translator
would
search
for
the
translation
of
a
difficult
technical
term
such
as
"
|
|
^tty
|
11l
fifflf
§
7^f
/
l
/
A
"
by
expanding
the
query
with
the
word
"
film
"
(
back
transliteration
of
the
component
of
the
term
in
question
)
.
This
kind
of
query
expansion
(
QE
)
indeed
increases
the
chance
of
finding
the
correct
translation
"
anisotropic
conductive
film
"
in
top-ranked
snippets
.
However
,
the
manual
process
of
expanding
query
,
sending
search
request
,
and
extracting
transliteration
is
tedious
and
time
consuming
.
Furthermore
,
unless
the
query
expansion
is
done
properly
,
snippets
containing
answers
might
not
be
ranked
high
enough
for
this
strategy
to
be
the
most
effective
.
We
present
a
new
system
,
TermMine
,
that
automatically
learns
to
extend
a
given
name
into
a
query
expected
to
retrieve
and
extract
transliterations
of
the
proper
name
.
An
example
of
machine
transliteration
of
"
Parthian
"
is
shown
in
Figure
1
.
TermMine
has
determined
the
best
10
query
expansions
(
e.g.
,
"
Parthian
G
,
"
"
Parthian
iffg
"
)
.
TermMine
learns
these
effective
expansions
auto
-
matically
during
training
by
analyzing
a
collection
of
place
names
and
their
transliterations
,
and
deriving
cross-language
relationships
of
prefix
and
postfix
morphemes
.
For
instance
,
TermMine
learns
that
a
name
that
begins
with
the
prefix
"
par
-
"
is
likely
to
have
a
transliteration
beginning
with
"
E
"
or
"
l
"
)
.
We
describe
the
learning
process
in
Section
This
prototype
demonstrates
a
novel
method
for
learning
to
find
transliterations
of
proper
nouns
on
the
Web
based
on
query
expansion
aimed
at
maximizing
the
probability
of
retrieving
transliterations
from
existing
search
engines
.
Since
the
method
involves
learning
the
morphological
relationships
between
names
and
their
transliterations
,
we
refer
to
this
IR-based
approach
as
morphological
query
expansion
approach
to
machine
transliteration
.
This
novel
approach
is
general
in
scope
and
can
also
be
applied
to
back
transliteration
and
to
translation
with
slight
modifications
,
even
though
we
focus
on
transliteration
in
this
paper
.
The
remainder
of
the
paper
is
organized
as
follows
.
First
,
we
give
a
formal
statement
for
the
problem
(
Section
2
)
.
Then
,
we
present
a
solution
to
the
problem
by
proposing
new
transliteration
probability
functions
,
describing
the
procedure
for
estimating
parameters
for
these
functions
(
Section
3
)
and
the
run-time
procedure
for
searching
and
ex
-
tracting
transliteration
via
a
search
engine
(
Section
4
)
.
As
part
of
our
evaluation
,
we
carry
out
two
sets
of
experiments
,
with
or
without
query
expansion
,
and
compare
the
results
.
We
also
evaluate
the
results
against
two
commercial
machine
translation
online
services
(
Section
5
)
.
2
Problem
Statement
Using
online
machine
translation
services
for
name
transliteration
does
not
work
very
well
.
Searching
in
the
vicinity
of
the
name
in
mixed-code
Web
pages
is
a
good
strategy
.
However
,
query
expansion
is
needed
for
this
strategy
to
be
effective
.
Therefore
,
to
find
transliterations
of
a
name
,
a
promising
approach
is
to
automatically
expand
the
given
name
into
a
query
with
the
additional
requirement
of
some
morpheme
expected
to
be
part
of
relevant
transliterations
that
might
appear
on
the
Web
.
Table
1
.
Sample
name-transliteration
pairs
from
the
Transliteration
Aabenraa
Aardenburg
Aalesund
Abacaxis
Now
,
we
formally
state
the
problem
we
are
dealing
with
:
While
a
proper
name
N
is
given
.
Our
goal
is
to
search
and
extract
the
transliteration
T
of
N
from
Web
pages
via
a
general-purpose
search
engine
SE
.
For
that
,
we
expand
N
into
a
set
of
queries
q1
,
q2
,
qm
,
such
that
the
top
n
document
snippets
returned
by
SE
for
the
queries
are
likely
to
contain
some
transliterations
T
of
the
given
name
N.
In
the
next
section
,
we
propose
using
a
probability
function
to
model
the
relationships
between
names
and
transliterations
and
describe
how
the
parameters
in
this
function
can
be
estimated
.
3
Learning
Relationships
for
QE
We
attempt
to
derive
cross-language
morphological
relationships
between
names
and
transliterations
and
use
them
to
expand
a
name
into
an
effective
query
for
searching
and
extracting
transliterations
.
For
the
purpose
of
expanding
the
given
name
,
N
,
into
effective
queries
to
search
and
extract
transliterations
T
,
we
define
a
probabilistic
function
for
mapping
prefix
syllable
from
the
source
to
the
target
languages
.
The
prefix
transliteration
function
P
(
TP
|
NP
)
is
the
probability
of
T
has
a
prefix
TP
under
the
condition
that
the
name
N
has
a
prefix
NP
.
where
Count
(
TP
,
NP
)
is
the
number
of
TP
and
NP
co-occurring
in
the
pairs
of
training
set
(
see
Table
1
)
,
and
Count
(
NP
)
is
the
number
of
NP
occurring
in
training
set
.
The
prefixes
and
postfixes
are
intended
as
a
syllable
in
the
two
languages
involved
,
so
the
two
prefixes
correspond
to
each
other
(
See
Table
2
&amp;
3
)
.
Due
to
the
differences
in
the
sound
inventory
,
the
Roman
prefix
corresponding
to
a
syllabic
prefix
in
Chinese
may
vary
,
ranging
from
a
consonant
,
a
vowel
,
or
a
consonant
followed
by
a
vowel
(
but
not
a
vowel
followed
by
a
consonant
)
.
So
,
it
is
likely
such
a
Roman
prefix
has
from
one
to
four
letters
.
On
the
contrary
,
the
prefix
syllable
for
a
name
written
in
Chinese
is
readily
identifiable
.
Table
2
.
Sample
cross-language
morphological
relation
-
Transliteration
Prefix
(
TP
)
Np
Count
Tp
Count
Table
3
.
Sample
cross-language
morphological
relationships
between
postfixes.___
Transliteration
Postfix
(
Ts
)
Ns
Count
Co-occ
.
Count
We
also
observe
that
a
preferred
prefix
(
e.g.
,
"
3c
"
(
Ai
)
)
is
often
used
for
a
Roman
prefix
(
e.g.
,
"
a
-
"
or
"
ir
-
"
)
,
while
occasionally
other
homo-phonic
characters
are
used
(
e.g.
,
"
±j
|
"
(
Ai
)
)
.
The
skew
distribution
creates
problems
for
reliable
estimation
of
transliteration
functions
.
To
cope
with
this
data
sparseness
problem
,
we
use
homophone
classes
and
a
function
CL
that
maps
homophonic
characters
to
the
same
class
number
.
For
instance
,
"
3c
"
and
"
±j
|
"
are
homophonic
,
and
both
are
assigned
the
same
class
identifier
(
see
Table
4
for
more
samples
)
.
Therefore
,
we
have
CL
(
"
3
"
)
=
CL
(
"
#
|
"
)
=
275
.
Table
4
.
Some
examples
of
classes
of
homophonic
characters
.
The
class
ID
of
each
class
is
assigned
arbitrarily._
Transl
.
With
homophonic
classes
of
transliteration
morphemes
,
we
define
class-based
transliteration
probability
as
follows
With
class-based
transliteration
probabilities
,
we
are
able
to
cope
with
difficulty
in
estimating
parameters
for
rare
events
which
are
under
represented
in
the
training
set
.
Table
5
shows
that
"
±j
|
"
belongs
to
a
homophonic
class
co-occurring
with
"
a
-
"
for
46
times
,
even
when
only
one
instance
of
^
,
a
-
)
.
After
cross-language
relationships
for
prefixes
and
postfixes
are
automatically
trained
,
the
prefix
relationships
are
stored
as
prioritized
query
expansion
rules
.
In
addition
to
that
,
we
also
need
a
transliteration
probability
function
to
rank
candidate
transliterations
at
run-time
(
Section
4
)
.
To
cope
with
data
sparseness
,
we
consider
names
(
or
transliterations
)
with
the
same
prefix
(
or
postfix
)
as
a
class
.
With
that
in
mind
,
we
use
both
prefix
and
postfix
to
formulate
an
interpolation-based
estimator
for
name
transliteration
probability
:
where
X1
+
X2
=
1
and
NP
,
NS
,
TP
,
and
TS
are
the
prefix
and
postfix
of
the
given
name
N
and
transliteration
T.
For
instance
,
the
probability
of
"
H^^H
;
^
55
"
(
Meisuobudamiya
)
as
a
transliteration
of
"
Mesopotamia
"
is
estimated
as
follows
(
1
)
For
each
entry
in
the
bilingual
name
list
,
pair
up
prefixes
and
postfixes
in
names
and
transliterations
.
(
2
)
Calculate
counts
of
these
affixes
and
their
cooccurrences
.
(
3
)
Estimate
the
prefix
and
postfix
transliteration
functions
(
4
)
Estimate
class-based
prefix
and
postfix
transliteration
functions_
Figure
2
.
Outline
of
the
process
used
to
train
the
TermMine
system
.
The
system
follows
the
procedure
shown
in
Figure
2
to
estimate
these
probabilities
.
In
Step
(
1
)
,
the
system
generates
all
possible
prefix
pairs
for
each
name-transliteration
pair
.
For
instance
,
consider
the
pair
,
(
"
Aabenraa
,
"
the
system
will
generate
eight
pairs
:
(
a
-
,
H
-
)
,
(
aa
-
,
H
-
)
,
(
aab
-
,
H
-
)
,
(
aabe
-
,
H
-
)
,
(
-
a
,
-
&amp;
)
,
(
-
aa
,
-
&amp;
)
,
(
-
raa
,
-
&amp;
)
,
and
(
-
nraa
,
-
&amp;
)
.
Finally
,
the
transliteration
probabilities
are
estimated
based
on
the
counts
of
prefixes
,
postfixes
,
and
their
co-occurrences
.
The
derived
probabilities
embody
a
number
of
relationships
:
4
Transliteration
Search
and
Extraction
At
run-time
,
the
system
follows
the
procedure
in
Figure
3
to
process
the
given
name
.
In
Step
(
1
)
,
the
system
looks
up
in
the
prefix
relationship
table
to
find
the
n
best
relationships
(
n
=
MaxExpQueries
)
for
query
expansion
with
preference
for
relationships
with
higher
probabilistic
value
.
For
instance
,
to
search
for
transliterations
of
"
Acton
,
"
the
system
looks
at
all
possible
prefixes
and
postfixes
of
"
Acton
,
"
including
a
-
,
ac
-
,
act
-
,
acto
-
,
-
n
,
-
on
,
-
ton
,
and
-
cton
,
and
determines
the
best
query
expansions
:
"
Acton
H
,
"
"
Acton
S
,
"
"
Acton
3c
,
"
"
Acton
jj
,
"
"
Acton
JfH
,
"
etc.
These
effective
expansions
are
automatically
derived
during
the
training
stage
described
in
Section
3
by
analyzing
a
large
collection
of
name-transliteration
pairs
.
In
Step
(
2
)
,
the
system
sends
off
each
of
these
queries
to
a
search
engine
to
retrieve
up
to
MaxDocRetrieved
document
snippets
.
In
Step
(
3
)
,
the
system
discards
snippets
that
have
too
little
proportion
of
target-language
text
.
See
Example
(
4
)
for
a
snippet
that
has
high
portion
of
English
text
and
therefore
is
less
likely
to
contain
a
transliteration
.
In
Step
(
4
)
,
the
system
considers
the
substrings
in
the
remaining
snippets
.
(
1
)
Look
up
the
table
for
top
MaxExpQueries
prefix
and
posfix
relationships
relevant
to
the
given
name
and
use
the
target
morphemes
in
the
relationship
to
form
expanded
queries
(
2
)
Search
for
Web
pages
with
the
queries
and
filter
out
snippets
containing
at
less
than
MinTargetRate
portion
of
target
language
text
(
3
)
Evaluate
candidates
based
on
class-based
transliteration
probability
(
Equation
5
)
(
4
)
Output
top
one
candidate
for
evaluation
Figure
3
.
Outline
of
the
steps
used
to
search
,
extract
,
and
rank
transliterations
.
Table
5
.
Sample
data
for
class-based
morphological
transliteration
probability
of
prefixes
,
where
#
of
NP
denotes
the
number
of
the
name
prefix
NP
;
#
of
C
,
NP
denotes
the
number
of
all
Tp
belonging
to
the
class
C
co-occurring
with
the
NP
;
#
TP
,
NP
denotes
the
number
of
transliteration
prefix
Tp
co-occurs
with
the
Np
;
P
(
C
|
Np
)
denotes
the
probability
of
all
Tp
belonging
to
C
co-occurring
with
the
Np
;
P
(
Tp
|
Np
)
denotes
the
probability
Table
6
.
Sample
data
for
class-based
morphological
transliteration
probability
of
postfixes
.
Notations
are
similar
to
those
for
Table
5
.
Class
ID
http
:
/
/
www.hkmassive.com
/
forum
/
viewthread.php
?
tid
=
2368
&amp;
fpage
=
1
Watch
the
slide
show
!
.
.
.
(
5
)
New
Home
Alert
-
Sing
Tao
^
New
Homes
Please
select
,
Acton
HjSJJi
,
Ajax
Sjjjrdr
,
Allis-ton
HMMJS
,
Ancaster
^3
?
#
,
Arthur
MM
,
Aurora
MMMl
,
Ayr
3ciSt
,
Barrie
EM
,
Beamsville
,
Belleville
.
.
.
Acton
|
Systems
is
a
world
leading
manufacturer
supplying
stuctured
cabling
systems
suited
to
the
Australian
and
New
Zealand
marketplace
.
M
$
N
|
33
t^MMR
&amp;
MM
0
Custom
made
leads
are
now
available
.
.
.
The
occurrence
counts
and
average
distance
from
instances
of
the
given
name
are
tallied
for
each
of
these
candidates
.
Candidates
with
a
low
occurrence
count
and
long
average
distance
are
excluded
from
further
consideration
.
Finally
,
all
candidates
are
evaluated
and
ranked
using
Equation
(
7
)
given
in
Section
3
.
5
Evaluation
In
the
experiment
carried
out
to
assess
the
feasibility
to
the
proposed
method
,
a
data
set
of
23,615
names
and
transliterations
was
used
.
This
set
of
place
name
data
is
available
from
NICT
,
Taiwan
for
training
and
testing
.
There
are
967
distinct
Chinese
characters
presented
in
the
data
,
and
more
details
of
training
data
are
available
in
Table
7
.
The
English
part
consists
of
Romanized
versions
of
names
originated
from
many
languages
,
including
Western
and
Asian
languages
.
Most
of
the
time
,
the
names
come
with
a
Chinese
counterpart
based
solely
on
transliteration
.
But
occasionally
,
the
Chinese
counterpart
is
part
translation
and
part
transliteration
.
For
instance
,
the
city
of
"
Southampton
"
has
a
Chinese
counterpart
consisting
of
"
"
(
translation
of
"
south
"
)
and
"
MllfJJi
"
(
transliteration
of
"
ampton
"
)
.
Type
of
Data
Used
in
Experiment
Name-transliteration
pairs
Training
data
Test
data
Distinct
transliteration
morphemes
Distinct
transliteration
morphemes
(
80
%
coverage
)
Names
with
part
translation
and
part
transliteration
(
estimated
)
Cross-language
prefix
relationships
Cross-language
postfix
relationships
We
used
the
set
of
parameters
shown
in
Table
8
to
train
and
run
System
TermMine
.
A
set
of
500
randomly
selected
were
set
aside
for
testing
.
We
paired
up
the
prefixes
and
postfixes
in
the
remaining
23,116
pairs
,
by
taking
one
to
four
leading
or
trailing
letters
of
each
Romanized
place
names
and
the
first
and
last
Chinese
transliteration
character
to
estimate
P
(
Tp
|
Np
)
and
P
(
Ts
|
Ns
)
.
Parameter
Description
MaxPrefixLetters
Max
number
of
letters
in
a
prefix
MaxPostfixLetters
Max
number
of
letters
in
a
postfix
MaxExpQueries
Max
number
of
expanded
queries
MaxDocRetrieved
Max
number
of
document
retrieved
MinTargetRate
Min
rate
of
target
text
in
a
snippet
MinOccCount
Min
number
of
cooccurrence
of
query
and
transliteration
candidate
in
snippets
MaxAvgDistance
Max
distance
between
N
and
T
WeightPrefixProb
Weight
of
Prefix
probability
WeightPostfixProb
Weight
of
Postfix
probability
(
X2
)
We
carried
out
two
kinds
of
evaluation
on
System
TermMine
,
with
and
without
query
expansion
.
With
QE
option
off
,
the
name
itself
was
sent
off
as
a
query
to
the
search
engine
,
while
with
QE
option
turned
on
,
up
to
10
expanded
queries
were
sent
for
each
name
.
We
also
evaluated
the
system
against
Google
Translate
and
Yahoo
!
Babelfish
.
We
discarded
the
results
when
the
names
are
returned
untranslated
.
After
that
,
we
checked
the
correctness
of
all
remaining
results
by
hand
.
Table
9
shows
a
sample
of
the
results
produced
by
the
three
systems
.
In
Table
10
,
we
show
performance
differences
of
system
TermMine
in
query
expansion
option
.
Without
QE
,
the
system
returns
transliterations
(
applicability
)
less
than
50
%
of
the
time
.
Nevertheless
,
there
are
enough
snippets
for
extracting
and
ranking
of
transliterations
.
The
precision
rate
of
the
top-ranking
transliterations
is
88
%
.
With
QE
turned
on
,
the
applicability
rate
increases
significantly
to
60
%
.
The
precision
rate
also
improved
slightly
to
0.89
.
The
performance
evaluation
of
three
systems
is
shown
in
Table
11
.
For
the
test
set
of
500
place
names
,
Google
Translate
returned
146
transliterations
and
Yahoo
!
Babelfish
returned
only
44
,
while
TermMine
returned
300
.
Of
the
returned
transliterations
,
Google
Translate
and
Yahoo
!
Babelfish
achieved
a
precision
rate
around
50
%
,
while
TermMine
achieved
a
precision
rate
almost
as
high
as
90
%
.
The
results
show
that
System
TermMine
outperforms
both
commercial
MT
systems
by
a
wide
margin
,
in
the
area
of
machine
transliteration
of
proper
names
.
Table
9
.
Sample
output
by
three
systems
evaluated
.
The
Palmerston
Cootamundra
Australasia
Inverness
Lomonosov
Oskaloosa
Table
10
.
Performance
evaluation
of
TermMine
TermMine
QE
-
#
of
cases
performed
#
Correct
Answers
TermMine
QE+
#
of
correct
answers
Applicability
Precision
F-measure
TermMine
Google
Translate
Yahoo
!
Babelfish
Arlington
6
Comparison
with
Previous
Work
Machine
transliteration
has
been
an
area
of
active
research
.
Most
of
the
machine
transliteration
method
attempts
to
model
the
transliteration
process
of
mapping
between
graphemes
and
phonemes
.
Knight
and
Graehl
(
1998
)
proposed
a
multilayer
model
and
a
generate-and-test
approach
to
perform
back
transliteration
from
Japanese
to
English
based
on
the
model
.
In
our
work
we
address
an
issue
of
producing
transliteration
by
way
of
search
.
Onaizan
and
Knight
(
2002
)
,
and
Oh
et
al.
(
2005
)
.
Recently
,
some
of
the
machine
transliteration
study
has
begun
to
consider
the
problem
of
extracting
names
and
their
transliterations
from
parallel
corpora
(
Qu
and
Grefenstette
2004
,
Lin
,
Wu
and
Chang
2004
;
Lee
and
Chang
2003
,
Li
and
Grefen
-
stette
2005
)
.
Cao
and
Li
(
2002
)
described
a
new
method
for
base
noun
phrase
translation
by
using
Web
data
.
Kwok
,
et
al.
(
2001
)
described
a
system
called
CHINET
for
cross
language
name
search
.
Nagata
et
al.
(
2001
)
described
how
to
exploit
proximity
and
redundancy
to
extract
translation
for
a
given
term
.
Lu
,
Chien
,
and
Lee
(
2002
)
describe
a
method
for
name
translation
based
on
mining
of
anchor
texts
.
More
recently
,
Zhang
,
Huang
,
and
Vogel
(
2005
)
proposed
to
use
occurring
words
to
expand
queries
for
searching
and
extracting
transliterations
.
Oh
and
Isahara
(
2006
)
use
phonetic-similarity
to
recognize
transliteration
pairs
on
the
Web
.
In
contrast
to
previous
work
,
we
propose
a
simple
method
for
extracting
transliterations
based
on
a
statistical
model
trained
automatically
on
a
bilingual
name
list
via
unsupervised
learning
.
We
also
carried
out
experiments
and
evaluation
of
training
and
applying
the
proposed
model
to
extract
transliterations
by
using
web
as
corpus
.
7
Conclusion
and
Future
Work
Morphological
query
expansion
represents
an
innovative
way
to
capture
cross-language
relations
in
name
transliteration
.
The
method
is
independent
of
the
bilingual
lexicon
content
making
it
easy
to
adopt
to
other
proper
names
such
person
,
product
,
or
organization
names
.
This
approach
is
useful
in
a
number
of
machine
translation
subtasks
,
including
name
transliteration
,
back
transliteration
,
named
entity
translation
,
and
terminology
translation
.
Many
opportunities
exist
for
future
research
and
improvement
of
the
proposed
approach
.
First
,
the
method
explored
here
can
be
extended
as
an
alterative
way
to
support
such
MT
subtasks
as
back
transliteration
(
Knight
and
Graehl
1998
)
and
noun
phrase
translation
(
Koehn
and
Knight
2003
)
.
Finally
,
for
more
challenging
MT
tasks
,
such
as
handling
sentences
,
the
improvement
of
translation
quality
probably
will
also
be
achieved
by
combining
this
IR-based
approach
and
statistical
machine
translation
.
For
example
,
a
pre-processing
unit
may
replace
the
proper
names
in
a
sentence
with
transliterations
(
e.g.
,
mixed
code
text
"
The
cities
of
H
^^ll^Si
prospered
under
EISS
and
WM
rule
.
"
before
sending
it
off
to
MT
for
final
translation
.
