We
consider
here
the
problem
of
Base
Noun
Phrase
translation
.
We
propose
a
new
method
to
perform
the
task
.
For
a
given
Base
NP
,
we
first
search
its
translation
candidates
from
the
web
.
We
next
determine
the
possible
translation
(
s
)
from
among
the
candidates
using
one
of
the
two
methods
that
we
have
developed
.
In
one
method
,
we
employ
an
ensemble
of
Naïve
Bayesian
Classifiers
constructed
with
the
EM
Algorithm.In
the
other
method
,
we
use
TF-IDF
vectors
also
constructed
with
the
EM
Algorithm
.
Experimental
results
indicate
that
the
coverage
and
accuracy
of
our
method
are
significantly
better
than
those
of
the
baseline
methods
relying
on
existing
technologies
.
Introduction
We
address
here
the
problem
of
Base
NP
translation
,
in
which
for
a
given
Base
Noun
Phrase
in
a
source
language
(
e.g.
,
'
information
age
'
in
English
)
,
we
are
to
find
out
its
possible
translation
(
s
)
in
a
target
language
(
e.g.
,
'
'
inChinese
)
.
We
define
a
Base
NP
as
a
simple
and
non-recursive
noun
phrase
.
In
many
cases
,
Base
NPs
represent
holistic
and
non-divisible
concepts
,
and
thus
accurate
translation
of
them
from
one
language
to
another
is
extremely
important
in
applications
like
machine
translation
,
cross
language
information
retrieval
,
and
foreign
language
writing
assistance
.
In
this
paper
,
we
propose
a
new
method
for
Base
NP
translation
,
which
contains
two
steps
:
(
1
)
translation
candidate
collection
,
and
(
2
)
translation
selection
.
In
translation
candidate
collection
,
for
a
given
Base
NP
in
the
source
language
,
we
look
for
its
translation
candidates
in
the
target
language
.
To
do
so
,
we
use
a
word-to-word
translation
dictionary
and
corpus
Hang
Li
Microsoft
Research
Asia
hangli
@
microsoft.com
data
in
the
target
language
on
the
web
.
In
translation
selection
,
we
determine
the
possible
translation
(
s
)
from
among
the
candidates
.
We
use
non-parallel
corpus
data
in
the
two
languages
on
the
web
and
employ
one
of
the
two
methods
which
we
have
developed
.
In
the
first
method
,
we
view
the
problem
as
that
of
classification
and
employ
an
ensemble
of
Naïve
Bayesian
Classifiers
constructed
with
the
EM
Algorithm
.
We
will
use
'
EM-NBC-Ensemble
'
to
denote
this
method
,
hereafter
.
In
the
second
method
,
we
view
the
problem
as
that
of
calculating
similarities
between
context
vectors
and
use
TF-IDF
vectors
also
constructed
with
the
EM
Algorithm
.
We
will
use
'
EM-TF-IDF
'
to
denote
this
method
.
Experimental
results
indicate
that
our
method
is
very
effective
,
and
the
coverage
and
top
3
accuracy
of
translation
at
the
final
stage
are
91.4
%
and
79.8
%
,
respectively
.
The
results
are
significantly
better
than
those
of
the
baseline
methods
relying
on
existing
technologies
.
The
higher
performance
of
our
method
can
be
attributed
to
the
enormity
of
the
web
data
used
and
the
employment
of
the
EM
Algorithm
.
2.1
Translation
with
Non-parallel
Corpora
A
straightforward
approach
to
word
or
phrase
translation
is
to
perform
the
task
by
using
parallel
bilingual
corpora
(
e.g.
,
Brown
et
al
,
1993
)
.
Parallel
corpora
are
,
however
,
difficult
to
obtain
in
practice
.
To
deal
with
this
difficulty
,
a
number
of
methods
have
been
proposed
,
which
make
use
of
relatively
easily
obtainable
non-parallel
corpora
(
e.g.
,
Fung
and
Yee
,
1998
;
Rapp
,
1999
;
Diab
and
Finch
,
2000
)
.
Within
these
methods
,
it
is
usually
assumed
that
a
number
of
translation
candidates
for
a
word
or
phrase
are
given
(
or
can
be
easily
collected
)
and
the
problem
is
focused
on
translation
selection
.
All
of
the
proposed
methods
manage
to
find
out
the
translation
(
s
)
of
a
given
word
or
phrase
,
on
the
basis
of
the
linguistic
phenomenon
that
the
contexts
of
a
translation
tend
to
be
similar
to
the
contexts
of
the
given
word
or
phrase
.
Fung
and
Yee
(
1998
)
,
for
example
,
proposed
to
represent
the
contexts
of
a
word
or
phrase
with
a
real-valued
vector
(
e.g.
,
a
TF-IDF
vector
)
,
in
which
one
element
corresponds
to
one
word
in
the
contexts
.
In
translation
selection
,
they
select
the
translation
candidates
whose
context
vectors
are
the
closest
to
that
of
the
given
word
or
phrase
.
Since
the
context
vector
of
the
word
or
phrase
to
be
translated
corresponds
to
words
in
the
source
language
,
while
the
context
vector
of
a
translation
candidate
corresponds
to
words
in
the
target
language
,
and
further
the
words
in
the
source
language
and
those
in
the
target
language
have
a
many-to-many
relationship
(
i.e.
,
translation
ambiguities
)
,
it
is
necessary
to
accurately
transform
the
context
vector
in
the
source
language
to
a
context
vector
in
the
target
language
before
distance
calculation
.
The
vector-transformation
problem
was
not
,
however
,
well-resolved
previously
.
Fung
and
Yee
assumed
that
in
a
specific
domain
there
is
only
one-to-one
mapping
relationship
between
words
in
the
two
languages
.
The
assumption
is
reasonable
in
a
specific
domain
,
but
is
too
strict
in
the
general
domain
,
in
which
we
presume
to
perform
translation
here
.
A
straightforward
extension
of
Fung
and
Yee
's
assumption
to
the
general
domain
is
to
restrict
the
many-to-many
relationship
to
that
of
many-to-one
mapping
(
or
one-to-one
mapping
)
.
This
approach
,
however
,
has
a
drawback
of
losing
information
in
vector
transformation
,
as
will
be
described
.
For
other
methods
using
non-parallel
corpora
,
see
also
(
Tanaka
and
Iwasaki
,
1996
;
Kikui
,
1999
,
Koehn
and
Kevin
2000
;
Sumita
2000
;
Nakagawa
2001
;
Gaoetal
,
2001
)
.
2.2
Translation
Using
Web
Data
Web
is
an
extremely
rich
source
of
data
for
natural
language
processing
,
not
only
in
terms
of
data
size
but
also
in
terms
of
data
type
(
e.g.
,
multilingual
data
,
link
data
)
.
Recently
,
a
new
trend
arises
in
natural
language
processing
,
which
tries
to
bring
some
new
breakthroughs
to
the
field
by
effectively
using
web
data
(
e.g.
,
Brill
et
al
,
2001
)
.
Nagata
et
al
(
2001
)
,
for
example
,
proposed
to
collect
partial
parallel
corpus
data
on
the
web
to
create
a
translation
dictionary
.
They
observed
that
there
are
many
partial
parallel
corpora
between
English
and
Japanese
on
the
web
,
and
most
typically
English
translations
of
Japanese
terms
(
words
or
phrases
)
are
parenthesized
and
inserted
immediately
after
the
Japanese
terms
in
documents
written
in
Japanese
.
Base
Noun
Phrase
Translation
Our
method
for
Base
NP
translation
comprises
of
two
steps
:
translation
candidate
collection
and
translation
selection
.
In
translation
candidate
collection
,
we
look
for
translation
candidates
ofa
given
Base
NP
.
In
translation
selection
,
we
find
out
possible
translation
(
s
)
from
the
translation
candidates
.
In
this
paper
,
we
confine
ourselves
to
translation
of
noun-noun
pairs
from
English
to
Chinese
;
our
method
,
however
,
can
be
extended
to
translations
of
other
types
of
Base
NPs
between
other
language
pairs
.
3.1
Translation
Candidate
Collection
We
use
heuristics
for
translation
candidate
collection
.
Figure
1
illustrates
the
process
of
collecting
Chinese
translation
candidates
for
an
English
Base
NP
'
information
age
'
with
the
heuristics
.
Consult
English-Chinese
word
translation
dictionary
:
information
-
&gt;
{
h
,
&amp;
Compositionally
create
translation
candidates
in
Chinese
:
Search
the
candidates
on
web
sites
in
Chinese
and
obtain
the
document
frequencies
of
them
(
i.e.
,
numbers
of
documents
containing
them
)
:
Output
candidates
having
non-zero
document
frequencies
and
the
document
frequencies
:
Figure
1
.
Translation
candidate
collection
3.2
Translation
EM-NBC-Ensemble
Selection
We
view
the
translation
selection
problem
as
that
of
classification
and
employ
EM-NBC-Ensemble
to
perform
the
task
.
For
the
ease
of
explanation
,
we
first
describe
the
algorithm
of
using
only
EM-NBC
and
next
extend
it
to
that
of
using
Basic
Algorithm
variable
on
C
.
Let
E
denote
a
set
of
words
in
English
,
and
C
a
set
of
words
in
Chinese
.
Suppose
that
|
E
|
=
m
and
|
C
|
=
n
.
Let
e
represent
a
random
variable
on
E
and
c
a
random
variable
on
C.
Figure
2
describes
the
algorithm
.
estimate
with
Maximum
Likelihood
Estimation
the
prior
Figure
2
.
Algorithm
of
EM-NBC-Ensemble
Context
Information
As
input
data
,
we
use
'
contexts
'
in
English
which
contain
the
phrase
to
be
translated
.
We
also
use
contexts
in
Chinese
which
contain
the
translation
candidates
.
Here
,
a
context
containing
a
phrase
is
defined
as
the
surrounding
words
within
a
window
of
a
predetermined
size
,
which
window
covers
the
phrase
.
We
can
easily
obtain
the
data
by
searching
for
them
on
the
web
.
Actually
,
the
contexts
containing
the
candidates
are
obtained
at
the
same
time
when
we
conduct
translation
candidate
collection
(
Step
4
in
Figure
1
)
.
EM
Algorithm
We
define
a
relation
between
E
and
C
as
R
c
E
X
C
,
which
represents
the
links
in
a
translation
dictionary
.
We
further
define
rc
=
{
e
|
(
e
,
c
)
e
R
}
.
We
estimate
the
parameters
of
the
distribution
by
using
the
Expectation
and
Maximization
(
EM
)
Algorithm
(
Dempster
et
al.
,
1977
)
.
Next
,
we
estimate
the
parameters
by
iteratively
updating
them
,
until
they
converge
(
cf.
,
Figure
3
)
.
Finally
,
we
calculate
fE
(
c
)
forall
ce
C
as
:
in
Chinese
D
=
(
fE
(
c1
)
^
.
/
e
(
c2
)
,
.
.
^JfE
(
cn
)
)
.
Prior
Probability
Estimation
At
Step
2
,
we
approximately
estimate
the
prior
probability
P
(
c
~
)
by
using
the
document
frequencies
of
the
translation
candidates
.
The
data
are
obtained
when
we
conduct
candidate
collection
(
Step
4
in
Figure
1
)
.
At
Step
2
,
we
use
an
EM-based
Naïve
Bayesian
Classifier
(
EM-NBC
)
to
select
the
candidates
cc
whose
posterior
probabilities
are
the
largest
:
Equation
(
S
)
is
based
on
Bayes
'
rule
and
the
assumption
that
the
data
in
D
are
independently
generated
from
P
(
c
|
c
)
,
c
g
C.
In
our
implementation
,
we
use
an
equivalent
where
«
&gt;
1
is
an
additional
parameter
used
to
emphasize
the
prior
information
.
If
we
ignore
the
first
term
in
Equation
(
4
)
,
then
the
use
of
one
EM-NBC
turns
out
to
select
the
candidate
whose
frequency
vector
is
the
closest
to
the
transformed
vector
D
in
terms
of
KL
divergence
(
cf.
,
Cover
and
Tomas
1991
)
.
To
further
improve
performance
,
we
use
an
ensemble
(
i.e.
,
a
linear
combination
)
of
classifiers
are
constructed
on
the
basis
ofthe
data
in
different
contexts
with
different
window
sizes
.
More
specifically
,
we
calculate
where
Di
,
(
i
=
1
,
•
•
•
,
s
)
denotes
the
data
in
different
contexts
.
We
view
the
translation
selection
problem
as
that
of
calculating
similarities
between
context
vectors
and
use
as
context
vectors
TF-IDF
vectors
constructed
with
the
EM
Algorithm
.
Figure
4
describes
the
algorithm
in
which
we
use
the
same
notations
as
those
in
EM-NBC-Ensemble
.
The
idfvalueofaChinesewordc
is
calculated
in
advance
and
as
idf
(
c
)
=
-
log
(
df
(
c
)
/
F
)
(
6
)
where
df
(
c
)
denotes
the
document
frequency
of
c
and
F
the
total
document
frequency
.
the
EM
algorithm
;
create
a
TF-IDF
vector
3.4
Advantage
of
Using
EM
Algorithm
The
uses
of
EM-NBC-Ensemble
and
EM-TF-IDF
can
be
viewed
as
extensions
of
existing
methods
for
word
or
phrase
translation
using
non-parallel
corpora
.
Particularly
,
the
use
of
the
EM
Algorithm
can
help
to
accurately
transform
a
frequency
vector
from
one
language
to
another
.
Suppose
that
we
are
to
determine
if
'
'
is
a
translation
of
'
information
age
'
(
actually
it
is
)
.
The
frequency
vectors
of
context
words
for
'
information
age
'
and
'
'
are
given
in
A
and
D
in
Figure
5
,
respectively
.
If
for
each
English
word
we
only
retain
the
link
connecting
to
the
Chinese
translation
with
the
largest
frequency
(
a
link
represented
as
a
solid
line
)
to
establish
a
many-to-one
mapping
and
transform
vector
A
from
English
to
Chinese
,
we
obtain
vector
B.
It
turns
out
,
however
,
that
vector
B
is
quite
different
from
vector
D
,
although
they
should
be
similar
to
each
other
.
We
will
refer
to
this
method
as
'
Major
Translation
'
hereafter
.
With
EM
,
vector
A
in
Figure
5
is
transformed
into
vector
C
,
which
is
much
closer
to
vector
D
,
as
expected
.
Specifically
,
EM
can
split
the
frequency
of
a
word
in
English
and
distribute
them
into
its
translations
in
Chinese
in
a
theoretically
sound
way
(
cf.
,
the
distributed
frequencies
of
'
internet
'
)
.
Note
that
if
we
assume
a
many-to-one
(
or
one-to-one
)
mapping
Internet^
Figure
5
.
Example
offrequencyvectortransformation
relationship
,
then
the
use
of
EM
turns
out
to
be
equivalent
to
that
of
Major
Translation
.
In
order
to
further
boost
the
performance
of
translation
,
we
propose
to
also
use
the
translation
method
proposed
inNagata
et
al.
Specifically
,
we
combine
our
method
with
that
of
Nagata
et
al
by
using
a
back-off
strategy
.
Input
'
information
asymmetry
'
;
Search
the
English
Base
NP
on
web
sites
in
Chinese
and
obtain
documents
as
follows
(
i.e.
,
using
partial
parallel
corpora
)
:
_
information
asymmetry
Find
the
most
frequently
occurring
Chinese
phrases
immediately
before
the
brackets
containing
the
English
Base
NP
,
using
a
suffix
tree
;
Output
the
Chinese
phrases
and
their
document
frequencies
:
Figure
6
.
Nagata
et
al
s
method
Figure
6
illustrates
the
process
of
collecting
Chinese
translation
candidates
for
an
English
Base
NP
'
information
asymmetry
'
with
Nagata
et
al
s
method
.
In
the
combination
of
the
two
methods
,
we
first
use
Nagata
et
al
s
method
to
perform
translation
;
if
we
cannot
find
translations
,
we
next
use
our
method
.
We
will
denote
this
strategy
'
Back-off
.
Experimental
Results
We
conducted
experiments
on
translation
of
the
Base
NPs
from
English
to
Chinese
.
3000
Base
NPs
extracted
.
In
the
experiments
,
we
used
the
HIT
English-Chinese
word
translation
dictionary2
.
The
dictionary
contains
about
76000
Chinese
words
,
60000
English
words
,
and
118000
translation
links
.
As
a
web
search
engine
,
we
used
Google
(
http
:
/
/
www.google.com
)
.
Five
translation
experts
evaluated
the
translation
results
by
judging
whether
or
not
they
were
acceptable
.
The
evaluations
reported
below
are
all
based
on
their
judgements
.
EM-NBC-Ensemble
and
EM-TF-IDF
.
Table
1
.
Best
translation
result
for
each
method
EM-NBC-Ensemble
MT-NBC-Ensemble
EM-KL-Ensemble
EM-TF-IDF
MT-TF-IDF
Table
1
shows
the
results
in
terms
ol
coverage
and
top
n
accuracy
.
Here
,
coverage
is
defined
as
the
percentage
of
phrases
which
have
translations
selected
,
while
top
n
accuracy
is
defined
as
the
percentage
of
phrases
whose
selected
top
n
translations
include
correct
translations
.
For
EM-NBC-Ensemble
,
we
set
the
«
in
(
4
)
to
be
5
on
the
basis
of
our
preliminary
experimental
results
.
For
EM-TF-IDF
,
we
used
the
non-web
data
describedinSection4.4
to
estimate
idf
values
of
words
.
We
used
contexts
with
window
sizes
of
±1
,
±3
,
±5
,
±7
,
±9
,
±11
.
1
http
:
/
/
encarta.msn.com
/
Default.asp
2
The
dictionary
is
created
by
the
Harbin
Institute
of
Technology
.
Figure
7
.
Translation
results
Figure
7
shows
the
results
of
EM-NBC-Ensemble
and
EM-TF-IDF
,
in
which
for
EM-NBC-Ensemble
'
window
size
'
denotes
that
of
the
largest
within
an
ensemble
.
Table
1
summarizes
the
best
results
for
each
ofthem
.
'
Prior
'
and
'
MT-TF-IDF
'
are
actually
baseline
methods
relying
on
the
existing
technologies
.
In
Prior
,
we
select
candidates
whose
prior
probabilities
are
the
largest
,
equivalently
,
document
frequencies
obtained
in
translation
candidate
collection
are
the
largest
.
In
MT-TF-IDF
,
we
use
TF-IDF
vectors
transformed
with
Major
Translation
.
Our
experimental
results
indicate
that
both
EM-NBC-Ensemble
and
EM-TF-IDF
significantly
outperform
Prior
and
MT-TF-IDF
,
when
appropriate
window
sizes
are
chosen
.
The
p-values
of
the
sign
tests
are
0.00056
and
0.00133
for
EM-NBC-Ensemble
,
0.00002
and
0.00901
for
EM-TF-IDF
,
respectively
.
We
next
removed
each
of
the
key
components
of
EM-NBC-Ensemble
and
used
the
remaining
components
as
a
variant
of
it
to
perform
translation
selection
.
The
key
components
are
(
1
)
distance
calculation
by
KL
divergence
(
2
)
EM
,
(
3
)
prior
probability
,
and
(
4
)
ensemble
.
The
variants
,
thus
,
respectively
make
use
of
(
1
)
the
baseline
method
'
Prior
'
,
(
2
)
an
ensemble
of
Naive
Bayesian
Classifiers
based
on
Major
Translation
(
MT-NBC-Ensemble
)
,
(
3
)
an
ensemble
of
EM-based
KL
divergence
calculations
and
Table
1
show
the
results
.
We
see
that
EM-NBC-Ensemble
outperforms
all
of
the
variants
,
indicating
that
all
the
components
within
EM-NBC-Ensemble
play
positive
roles
.
We
removed
each
of
the
key
components
of
EM-TF-IDF
and
used
the
remaining
components
as
a
variant
of
it
to
perform
translation
selection
.
The
key
components
are
(
1
)
idf
value
and
(
2
)
EM
.
The
variants
,
thus
,
respectively
make
use
of
(
1
)
EM-based
frequency
vectors
(
EM-TF
)
,
(
2
)
the
baseline
method
MT-TF-IDF
.
Figure
7
and
Table
1
show
the
results
.
We
see
that
EM-TF-IDF
outperforms
both
variants
,
indicating
that
all
of
the
components
within
EM-TF-IDF
are
needed
.
Comparing
the
results
between
MT-NBC-Ensemble
and
EM-NBC-Ensemble
and
the
results
between
MT-TF-IDF
and
Algorithm
can
indeed
help
to
improve
translation
accuracies
.
Table
2
.
Sample
of
translation
outputs
Translation
calcium
ion
adventure
tale
lung
cancer
aircraft
carrier
adult
literacy
Table
2
shows
translations
of
five
Base
NPs
as
output
by
EM-NBC-Ensemble
,
in
which
the
translations
marked
with
*
were
judged
incorrect
by
human
experts
.
We
analyzed
the
reasons
for
incorrect
translations
and
found
that
the
incorrect
translations
were
due
to
:
(
1
)
no
existence
of
dictionary
entry
(
19
%
)
,
(
2
)
non-compositional
translation
(
13
%
)
,
(
3
)
ranking
error
(
68
%
)
.
Table
3
.
Our
Method
Nagata
et
al
's
We
next
used
Nagata
et
al
's
method
to
perform
translation
.
From
Table
3
,
we
can
see
that
the
accuracy
of
Nagata
et
al
's
method
is
higher
than
that
of
our
method
,
but
the
coverage
of
it
is
lower
.
The
results
indicate
that
our
proposed
Back-off
strategy
for
translation
is
justifiable
.
Table
4
.
Back-off
(
Ensemble
)
In
the
experiment
,
we
tested
the
Back-offstrategy
,
Table
4
shows
the
results
.
The
Back-off
strategy
helps
to
further
improve
the
results
whether
EM-NBC-Ensemble
or
EM-TF-IDF
is
used
.
To
test
the
effectiveness
of
the
use
of
web
data
,
we
conducted
another
experiment
in
which
we
performed
translation
by
using
non-web
data
.
The
data
comprised
of
the
Wall
Street
Journal
corpus
in
English
(
1987-1992
,
500MB
)
and
the
People
's
Daily
corpus
in
Chinese
(
1982-1998
,
700MB
)
.
We
followed
the
Back-off
strategy
as
in
Section
4.3
to
translate
the
1000
Base
NPs
.
Table
5
.
Translation
results
Coverage
Web
(
EM-NBC-Ensemble
)
Non-web
(
EM-NBC-Ensemble
)
The
results
in
Table
5
show
that
the
use
of
web
data
can
yield
better
results
than
non-use
of
it
,
although
the
sizes
of
the
non-web
data
we
used
were
considerably
large
in
practice
.
For
Nagata
et
al
's
method
,
we
found
that
it
was
almost
impossible
to
find
partial-parallel
corpora
in
the
non-web
data
.
Conclusions
This
paper
has
proposed
a
new
and
effective
method
for
Base
NP
translation
by
using
web
data
and
the
EM
Algorithm
.
Experimental
results
show
that
it
outperforms
the
baseline
methods
based
on
existing
techniques
,
mainly
due
to
the
employment
of
EM
.
Experimental
results
also
show
that
the
use
of
web
data
is
more
effective
than
non-use
of
it
.
Future
work
includes
further
applying
the
proposed
method
to
the
translation
of
other
types
of
Base
NPs
and
between
other
language
pairs
.
Acknowledgements
We
thank
Ming
Zhou
,
Chang-Ning
Huang
,
Jianfeng
Gao
,
and
Ashley
Chang
for
many
helpful
discussions
on
this
research
project
.
We
also
acknowledge
Shenjie
Li
for
help
with
program
coding
.
