We
present
the
idea
of
estimating
semantic
distance
in
one
,
possibly
resource-poor
,
language
using
a
knowledge
source
in
another
,
possibly
resource-rich
,
language
.
We
do
so
by
creating
cross-lingual
distributional
profiles
of
concepts
,
using
a
bilingual
lexicon
and
a
bootstrapping
algorithm
,
but
without
the
use
of
any
sense-annotated
data
or
word-aligned
corpora
.
The
cross-lingual
measures
of
semantic
distance
are
evaluated
on
two
tasks
:
(
1
)
estimating
semantic
distance
between
words
and
ranking
the
word
pairs
according
to
semantic
distance
,
and
(
2
)
solving
Reader
's
Digest
'
Word
Power
'
problems
.
In
task
(
1
)
,
cross-lingual
measures
are
superior
to
conventional
monolingual
measures
based
on
a
wordnet
.
In
task
(
2
)
,
cross-lingual
measures
are
able
to
solve
more
problems
correctly
,
and
despite
scores
being
affected
by
many
tied
answers
,
their
overall
performance
is
again
better
than
the
best
monolingual
measures
.
1
Introduction
Accurately
estimating
the
semantic
distance
between
concepts
or
between
words
in
context
has
pervasive
applications
in
computational
linguistics
,
including
machine
translation
,
information
retrieval
,
speech
recognition
,
spelling
correction
,
and
text
categorization
(
see
Budanitsky
and
Hirst
(
2006
)
for
discussion
)
,
and
it
is
becoming
clear
that
basing
such
measures
on
a
combination
of
corpus
statistics
with
a
knowledge
source
,
such
as
a
dictionary
,
published
thesaurus
,
or
WordNet
,
can
result
in
higher
accuracies
(
Mohammad
and
Hirst
,
2006b
)
.
This
is
because
such
knowledge
sources
capture
semantic
information
about
concepts
and
,
to
some
extent
,
world
knowledge
.
They
also
act
as
sense
inventories
for
the
words
in
a
language
.
However
,
applying
algorithms
for
semantic
distance
to
most
languages
is
hindered
by
the
lack
of
linguistic
resources
.
In
this
paper
,
we
propose
a
new
method
that
allows
us
to
compute
semantic
distance
in
a
possibly
resource-poor
language
by
seamlessly
combining
its
text
with
a
knowledge
source
in
a
different
,
preferably
resource-rich
,
language
.
We
demonstrate
the
approach
by
combining
German
text
with
an
English
thesaurus
to
create
English-German
distributional
profiles
of
concepts
,
which
in
turn
will
be
used
to
measure
the
semantic
distance
between
German
words
.
Two
classes
of
methods
have
been
used
in
determining
semantic
distance
.
Semantic
measures
of
concept-distance
,
such
as
those
of
Jiang
and
Con-rath
(
1997
)
and
Resnik
(
1995
)
,
rely
on
the
structure
of
a
knowledge
source
,
such
as
WordNet
,
to
determine
the
distance
between
two
concepts
defined
in
it
(
see
Budanitsky
and
Hirst
(
2006
)
for
a
survey
)
.
Distributional
measures
of
word-distance1
,
such
as
cosine
and
a-skew
divergence
(
Lee
,
2001
)
,
deem
1Many
distributional
approaches
represent
the
sets
of
contexts
of
the
target
words
as
points
in
multidimensional
cooccurrence
space
or
as
co-occurrence
distributions
.
A
measure
,
such
as
cosine
,
that
captures
vector
distance
or
a
measure
,
such
as
a-skew
divergence
,
that
captures
distance
between
distributions
is
then
used
to
measure
distributional
distance
.
We
will
therefore
refer
to
these
measures
as
distributional
measures
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
571-580
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
two
words
to
be
closer
or
less
distant
if
they
occur
in
similar
contexts
(
see
Mohammad
and
Hirst
(
2005
)
for
a
comprehensive
survey
)
.
Distributional
measures
rely
simply
on
raw
text
and
possibly
some
shallow
syntactic
processing
.
They
do
not
require
any
other
manually-created
resource
,
and
tend
to
have
a
higher
coverage
.
However
,
by
themselves
they
perform
poorly
when
compared
to
semantic
measures
(
Mohammad
and
Hirst
,
2006b
)
because
when
given
a
target
word
pair
we
usually
need
the
distance
between
their
closest
senses
,
but
distributional
measures
of
word-distance
tend
to
conflate
the
distances
between
all
possible
sense
pairs
.
Latent
semantic
analysis
(
LSA
)
(
Landauer
et
al.
,
1998
)
has
also
been
used
to
measure
distributional
distance
with
encouraging
results
(
Rapp
,
2003
)
.
However
,
it
too
measures
the
distance
between
words
and
not
senses
.
Further
,
the
dimensionality
reduction
inherent
to
LSA
has
the
effect
of
making
the
predominant
sense
more
dominant
while
de-emphasizing
the
other
senses
.
Therefore
,
an
LSA-based
approach
will
also
conflate
information
from
the
different
senses
,
and
even
more
emphasis
will
be
placed
on
the
predominant
senses
.
Given
the
semantically
close
target
nouns
play
and
actor
,
for
example
,
a
distributional
measure
will
give
a
score
that
is
some
sort
of
a
dominance-based
average
of
the
distances
between
their
senses
.
The
noun
play
has
the
predominant
sense
of
'
children
's
recreation
'
(
and
not
'
drama
'
)
,
so
a
distributional
measure
will
tend
to
give
the
target
pair
a
large
(
and
thus
erroneous
)
distance
score
.
Also
,
distributional
word-distance
approaches
need
to
create
large
V
x
V
cooccurrence
and
distance
matrices
,
where
V
is
the
size
ofthe
vocabulary
(
usually
at
least
100,000
)
.
Mohammad
and
Hirst
(
2006b
)
proposed
a
way
of
combining
written
text
with
a
published
thesaurus
to
measure
distance
between
concepts
(
or
word
senses
)
using
distributional
measures
,
thereby
eliminating
sense-conflation
and
achieving
results
better
than
the
simple
word-distance
measures
and
indeed
also
most
of
the
WordNet-based
semantic
measures
.
We
called
these
measures
distributional
measures
of
concept-distance
.
Concept-distance
2LSA
is
especially
expensive
as
singular
value
decomposition
,
a
key
component
for
dimensionality
reduction
,
requires
computationally
intensive
matrix
operations
;
making
it
less
scalable
to
large
amounts
of
text
(
Gorman
and
Curran
,
2006
)
.
measures
can
be
used
to
measure
distance
between
a
word
pair
by
choosing
the
distance
between
their
closest
senses
.
Thus
,
even
though
'
children
's
recreation
'
is
the
predominant
sense
of
play
,
the
'
drama
'
sense
is
much
closer
to
actor
and
so
their
distance
will
be
chosen
.
These
distributional
concept-distance
approaches
need
to
create
only
V
x
C
cooccurrence
and
C
x
C
distance
matrices
,
where
C
is
the
number
of
categories
or
senses
(
usually
about
1000
)
.
It
should
also
be
noted
that
unlike
the
best
WordNet-based
measures
,
distributional
measures
(
both
word
-
and
concept-distance
ones
)
can
be
used
to
estimate
not
just
semantic
similarity
but
also
semantic
relatedness
—
useful
in
many
tasks
including
information
retrieval
.
However
,
the
high-quality
thesauri
and
(
to
a
much
greater
extent
)
WordNet-like
resources
that
these
methods
require
do
not
exist
for
most
of
the
3000-6000
languages
in
existence
today
and
they
are
costly
to
create
.
In
this
paper
,
we
introduce
cross-lingual
distributional
measures
of
concept-distance
,
orsimply
cross-lingual
measures
,
that
determine
the
distance
between
a
word
pair
belonging
to
a
resource-poor
language
using
a
knowledge
source
in
a
resource-rich
language
and
a
bilingual
lexicon3
.
We
will
use
the
cross-lingual
measures
to
calculate
distances
between
German
words
using
an
English
thesaurus
and
a
German
corpus
.
Although
German
is
not
resource-poor
per
se
,
Gurevych
(
2005
)
has
observed
that
the
German
wordnet
GermaNet
(
Kunze
,
2004
)
(
about
60,000
synsets
)
is
less
developed
than
the
English
WordNet
(
Fellbaum
,
1998
)
(
about
117,000
synsets
)
with
respect
to
the
coverage
oflexical
items
and
lexical
semantic
relations
represented
therein
.
On
the
other
hand
,
substantial
raw
corpora
are
available
for
the
German
language
.
Crucially
for
our
evaluation
,
the
existence
of
GermaNet
allows
comparison
of
our
cross-lingual
approach
with
monolingual
ones
.
2
Monolingual
Distributional
Measures
In
order
to
set
the
context
for
cross-lingual
concept-distance
measures
(
Section
3
)
,
we
first
summarize
monolingual
distributional
approaches
,
with
a
focus
on
distributional
concept-distance
measures
.
3For
most
languages
that
have
been
the
subject
ofacademic
study
,
there
exists
at
least
a
bilingual
lexicon
mapping
the
core
vocabulary
of
that
language
to
a
major
world
language
and
a
corpus
of
at
least
a
modest
size
.
Words
that
occur
in
similar
contexts
tend
to
be
se-mantically
close
.
In
our
experiments
,
we
defined
the
context
of
a
target
word
,
its
co-occurring
words
,
to
be
±5
words
on
either
side
(
but
not
crossing
sentence
boundaries
)
.
The
set
of
contexts
of
a
target
word
is
usually
represented
by
the
strengths
of
association
of
the
target
with
its
co-occurring
words
,
which
we
refer
to
as
the
distributional
profile
(
DP
)
of
the
word
.
Here
is
a
constructed
example
DP
of
the
word
star
:
star
:
space
0.28
,
movie
0.2
,
famous
0.13
,
light
0.09
,
rich
0.04
,
.
Simple
counts
are
made
ofhow
often
the
targetword
co-occurs
with
other
words
in
text
and
how
often
the
words
occur
individually
.
A
suitable
statistic
,
such
as
pointwise
mutual
information
(
PMI
)
,
is
then
applied
to
these
counts
to
determine
the
strengths
of
association
between
the
target
and
co-occurring
words
.
The
distributional
profiles
of
two
target
words
represent
their
contexts
as
points
in
multidimensional
word-space
.
A
suitable
distributional
measure
(
for
example
,
cosine
)
gives
the
distance
between
the
two
points
,
and
thereby
an
estimate
of
the
semantic
distance
between
the
target
words
.
In
Mohammad
and
Hirst
(
2006b
)
,
we
show
how
distributional
profiles
of
concepts
(
DPCs
)
can
be
used
to
measure
semantic
distance
.
Below
are
the
DPCs
or
DPs
of
two
senses
of
the
word
star
(
the
senses
or
concepts
themselves
are
glossed
by
a
set
of
near-synonymous
words
,
placed
in
parentheses
)
:
DPs
of
concepts
'
celestial
body
'
(
celestial
body
,
constellation
0.11
,
.
.
.
famous
0.24
,
movie
0.14
,
rich
0.14
,
.
.
.
Thus
the
profiles
of
two
target
concepts
represent
their
contexts
as
points
in
multi-dimensional
wordspace
.
A
suitable
distributional
measure
(
for
example
,
cosine
)
can
then
be
used
to
give
the
distributional
distance
between
the
two
concepts
in
the
same
way
that
distributional
word-distance
is
measured
.
But
to
calculate
the
strength
of
association
of
a
concept
with
co-occurring
words
,
in
order
to
create
DPCs
,
we
must
determine
the
number
of
times
a
word
used
in
that
sense
co-occurs
with
surrounding
words
.
In
Mohammad
and
Hirst
(
2006a
)
,
we
proposed
a
way
to
determine
these
counts
without
the
use
of
sense-annotated
data
.
Briefly
,
a
word-category
co-occurrence
matrix
(
WCCM
)
is
created
having
English
word
types
wen
as
one
dimension
and
English
thesaurus
categories
cen
as
another
.
We
used
the
Macquarie
Thesaurus
(
Bernard
,
1986
)
both
as
a
very
coarsegrained
sense
inventory
and
a
source
of
possibly
ambiguous
English
words
that
together
unambiguously
represent
each
category
(
concept
)
.
The
WCCM
is
populated
with
co-occurrence
counts
from
a
large
English
corpus
(
we
used
the
British
National
Corpus
(
BNC
)
)
.
A
particular
cell
mij
,
corresponding
to
word
wen
and
concept
cejn
,
is
populated
with
the
number
of
times
wen
co-occurs
(
in
a
window
of
±5
words
)
with
any
word
that
has
ce
"
as
one
of
its
senses
(
i.e.
,
wen
co-occurs
with
any
word
listed
under
concept
cejn
in
the
thesaurus
)
.
This
matrix
,
created
after
a
first
pass
of
the
corpus
,
is
the
base
word-category
co-occurrence
matrix
(
base
WCCM
)
and
it
captures
strong
associations
between
a
sense
and
co-occurring
words.4
This
is
similar
to
how
Yarowsky
(
1992
)
identifies
words
that
are
indicative
ofa
particular
sense
ofthe
target
.
We
know
that
words
that
occur
close
to
a
target
word
tend
to
be
good
indicators
ofits
intended
sense
.
Therefore
,
we
make
a
second
pass
ofthe
corpus
,
using
the
base
WCCM
to
roughly
disambiguate
the
words
in
it
.
For
each
word
,
the
strength
of
association
of
each
of
the
words
in
its
context
(
±5
words
)
4From
the
base
WCCM
we
can
determine
the
number
of
times
a
word
w
and
concept
c
co-occur
,
the
number
of
times
w
co-occurs
with
any
concept
,
and
the
number
of
times
c
co-occurs
with
any
word
.
A
statistic
such
as
PMI
can
then
give
the
strength
of
association
between
w
and
c.
with
each
of
its
senses
is
summed
.
The
sense
that
has
the
highest
cumulative
association
is
chosen
as
the
intended
sense
.
A
new
bootstrapped
WCCM
is
created
such
that
each
cell
mij
,
corresponding
to
word
wien
and
concept
cejn
,
is
populated
with
the
number
of
times
wien
co-occurs
with
any
word
used
in
sense
cejn
.
Mohammad
and
Hirst
(
2006a
)
used
the
DPCs
created
from
the
bootstrapped
WCCM
to
attain
near-upper-bound
results
in
the
task
of
determining
word
sense
dominance
.
Unlike
the
McCarthy
et
al.
(
2004
)
dominance
system
,
our
approach
can
be
applied
to
much
smaller
target
texts
(
a
few
hundred
sentences
)
without
the
need
for
a
large
similarly-sense-distributed
text5
.
In
Mohammad
and
Hirst
(
2006a
)
,
the
DPC-based
monolingual
distributional
measures
of
concept-distance
were
used
to
rank
word
pairs
by
their
semantic
similarity
and
to
correct
realword
spelling
errors
,
attaining
markedly
better
results
than
monolingual
distributional
measures
of
word-distance
.
In
the
spelling
correction
task
,
the
distributional
concept-distance
measures
performed
better
than
all
WordNet-based
measures
as
well
,
except
for
the
Jiang
and
Conrath
(
1997
)
measure
.
3
Cross-lingual
Distributional
Measures
We
now
describe
how
distributional
measures
of
concept-distance
can
be
used
in
a
cross-lingual
framework
to
determine
the
distance
between
words
in
(
resource-poor
)
language
L1
by
combining
its
text
with
a
thesaurus
in
(
resource-rich
)
language
L2
,
us-ing
an
L1-L2
bilingual
lexicon
.
We
will
compare
this
approach
with
the
best
monolingual
approaches
;
the
smaller
the
loss
in
performance
,
the
more
capable
the
algorithm
is
of
overcoming
ambiguities
in
word
translation
.
An
evaluation
,
therefore
,
requires
an
L1
that
in
actuality
has
adequate
knowledge
sources
.
Therefore
we
chose
German
to
stand
in
as
the
resource-poor
language
L1
and
English
as
the
resource-rich
L2
;
the
monolingual
evaluation
in
German
will
use
GermaNet
.
The
remainder
of
the
paper
describes
our
approach
in
terms
of
German
and
English
,
but
the
algorithm
itself
is
language
independent
.
5The
McCarthy
et
al.
(
2004
)
system
needs
to
first
generate
a
distributional
thesaurus
from
the
target
text
(
if
it
is
large
enough
—
a
few
million
words
)
or
from
another
large
text
with
a
distribution
of
senses
similar
to
the
target
text
.
Given
a
German
word
wde
in
context
,
we
use
a
German-English
bilingual
lexicon
to
determine
its
different
possible
English
translations
.
Each
English
translation
wen
may
have
one
or
more
possible
coarse
senses
,
as
listed
in
an
English
thesaurus
.
These
English
thesaurus
concepts
(
cen
)
will
be
referred
to
as
cross-lingual
candidate
senses
of
the
German
word
wde.6
Figure
1
depicts
examples.7
As
in
the
monolingual
distributional
measures
,
the
distance
between
two
concepts
is
calculated
by
first
determining
their
DPs
.
However
,
in
the
cross-lingual
approach
,
a
concept
is
now
glossed
by
near-synonymous
words
in
an
English
thesaurus
,
whereas
its
profile
is
made
up
of
the
strengths
of
association
with
co-occurring
German
words
.
Here
are
constructed
example
cross-lingual
DPs
of
the
two
senses
of
star
:
Cross-lingual
DPs
of
concepts
'
celestial
body
'
(
celestial
body
,
sun
,
.
.
.
)
:
Raum
0.36
,
Licht
0.27
,
Konstellation
0.11
,
.
.
.
'
celebrity
'
(
celebrity
,
hero
,
.
.
.
)
:
beruhmt
0.24
,
Film
0.14
,
reich
0.14
,
.
.
.
In
order
to
calculate
the
strength
of
association
,
we
must
first
determine
individual
word
and
concept
counts
,
as
well
as
their
co-occurrence
counts
.
3.2
Cross-lingual
word-category
co-occurrence
matrix
We
create
a
cross-lingual
word-category
cooccurrence
matrix
with
German
word
types
wde
as
one
dimension
and
English
thesaurus
concepts
cen
6Some
of
the
cross-lingual
candidate
senses
of
wde
might
not
really
be
senses
of
wde
(
e.g.
,
'
celebrity
'
,
'
river
bank
'
,
and
'
judiciary
'
in
Figure
1
)
.
However
,
as
substantiated
by
experiments
in
Section
4
,
our
algorithm
is
able
to
handle
the
added
ambiguity
.
7
Vocabulary
of
German
words
needed
to
understand
this
discussion
:
Bank
:
1
.
financial
institution
,
2
.
bench
(
furniture
)
;
berühmt
:
famous
;
Film
:
movie
(
motion
picture
)
;
Himmelskörper
:
heavenly
body
;
Könstellatiön
:
constellation
;
Licht
:
light
;
Mörgensönne
:
morning
sun
;
Raum
:
space
;
reich
:
rich
;
Sönne
:
sun
;
Star
:
star
(
celebrity
)
;
Stern
:
star
(
celestial
body
)
celebrity
t
celestial
body
river
bank
financial
institution
furniture
judiciary
star
Stern
celestial
body
Himmelskörper
Sonne
Morgensonne
Star
Stern
.
.
.
}
wde
Figure
1
:
The
cross-lingual
candidate
senses
of
Ger
-
Figure
2
:
Words
having
'
celestial
body
'
as
one
of
man
words
Stern
and
Bank
.
their
cross-lingual
candidate
senses
.
as
another
.
The
matrix
is
populated
with
co-occurrence
counts
from
a
large
German
corpus
;
we
used
the
newspaper
corpus
,
taz8
(
Sep
1986
to
May
1999
;
240
million
words
)
.
A
particular
cell
mij
,
corresponding
to
word
wde
and
concept
ceen
,
is
populated
with
the
number
of
times
the
German
word
wde
co-occurs
(
in
a
window
of
±5
words
)
with
any
German
word
having
cen
as
one
of
its
cross-lingual
candidate
senses.For
example
,
the
Raum
-
'
celestial
body
'
cell
will
have
the
sum
of
the
number
of
times
Raum
co-occurs
with
Himmelskörper
,
Sonne
,
Morgensonne
,
Star
,
Stern
,
and
so
on
(
see
Figure
2
)
.
We
used
the
Macquarie
Thesaurus
(
Bernard
,
1986
)
(
about
98,000
words
)
for
our
experiments
.
The
possible
German
translations
of
an
English
word
were
taken
from
the
German-English
bilingual
lexicon
BEOLINGUS9
(
about
265,000
entries
)
.
This
base
word-category
co-occurrence
matrix
(
base
WCCM
)
,
created
after
a
first
pass
of
the
corpus
captures
strong
associations
between
a
category
(
concept
)
and
co-occurring
words
.
For
example
,
even
though
we
increment
counts
for
both
Raum
-
'
celestial
body
'
and
Raum
-
'
celebrity
'
for
a
particular
instance
where
Raum
co-occurs
with
Star
,
Raum
will
co-occur
with
a
number
of
words
such
as
Him-melskorper
,
Sonne
,
and
Morgensonne
that
each
have
the
sense
of
celestial
body
in
common
(
see
Figure
2
)
,
whereas
all
their
other
senses
are
likely
different
and
distributed
across
the
set
of
concepts
.
Therefore
,
the
co-occurrence
count
of
Raum
and
'
celestial
body
'
will
be
relatively
higher
than
that
of
Raum
and
'
celebrity
.
As
in
the
monolingual
case
,
a
second
pass
of
the
corpus
is
made
to
disambiguate
the
(
German
)
words
in
it
.
For
each
word
,
the
strength
of
association
of
each
of
the
words
in
its
context
(
±5
words
)
with
each
of
its
cross-lingual
candidate
senses
is
summed
.
The
sense
that
has
the
highest
cumulative
association
with
co-occurring
words
is
chosen
as
the
intended
sense
.
A
new
bootstrapped
WCCM
is
created
by
populating
each
cell
mij
,
correspond-ingtoword
wde
and
concept
ce
"
,
with
the
number
of
times
the
German
word
wde
co-occurs
with
any
German
word
used
in
cross-lingual
sense
ce
"
.
A
statistic
such
as
PMI
is
then
applied
to
these
counts
to
determine
the
strengths
of
association
between
a
target
concept
and
co-occurring
words
,
giving
the
distributional
profile
of
the
concept
.
Following
the
ideas
described
above
,
Mohammad
et
al.
(
2007
)
created
Chinese-English
DPCs
from
Chinese
text
,
a
Chinese-English
bilingual
lexicon
,
and
an
English
thesaurus
.
They
used
these
DPCs
to
implement
an
unsupervised
naive
Bayes
word
sense
classifier
that
placed
first
among
all
unsupervised
systems
taking
part
in
the
Multilingual
Chinese-English
Lexical
Sample
Task
(
task
#5
)
of
SemEval-07
(
Jin
et
al.
,
2007
)
.
4
Evaluation
We
evaluated
the
newly
proposed
cross-lingual
distributional
measures
of
concept-distance
on
the
tasks
of
(
1
)
measuring
semantic
distance
between
German
words
and
ranking
German
word
pairs
according
to
semantic
distance
,
and
(
2
)
solving
German
'
Word
Power
'
questions
from
Reader
's
Digest.In
order
to
compare
results
with
state-of-the-art
monolingual
approaches
we
conducted
experiments
using
Ger
-
(
Cross-lingual
)
Distributional
Measures
(
Monolingual
)
GermaNet
Measures
Information
Content-based
Lesk-like
Table
1
:
Distance
measures
used
in
our
experiments
.
Language
#
subjects
Correlation
Table
2
:
Comparison
of
datasets
used
for
evaluating
semantic
distance
in
German
.
maNet
measures
as
well
.
The
specific
distributional
measures10
and
GermaNet-based
measures
we
used
are
listed
in
Table
1
.
The
GermaNet
measures
are
of
two
kinds
:
(
1
)
information
content
measures,11
and
(
2
)
Lesk-like
measures
that
rely
on
n-gram
overlaps
in
the
glosses
of
the
target
senses
,
proposed
by
Gurevych
(
2005
)
12
.
The
cross-lingual
measures
combined
the
German
newspaper
corpus
taz
with
the
English
Macquarie
Thesaurus
using
the
German-English
bilingual
lexicon
BEOLINGUS
.
Multi-word
expressions
in
the
thesaurus
and
the
bilingual
lexicon
were
ignored
.
We
used
a
context
of
±5
words
on
either
side
of
the
target
word
for
creating
the
base
and
bootstrapped
WCCMs
.
No
syntactic
pre-processing
was
done
,
nor
were
the
words
stemmed
,
lemmatized
,
or
part-of-speech
tagged
.
4.1
Measuring
distance
in
word
pairs
A
direct
approach
to
evaluate
distance
measures
is
to
compare
them
with
human
judgments
.
Gurevych
10JSD
and
ASD
calculate
the
difference
in
distributions
of
words
that
co-occur
with
the
targets
.
Lindist
(
distributional
measure
)
and
LinGN
(
GermaNet
measure
)
follow
from
Lin
's
(
1998b
)
information-theoretic
definition
of
similarity
.
11
Information
content
measures
rely
on
finding
the
lowest
common
subsumer
(
lcs
)
ofthe
target
synsets
in
a
hypernym
hierarchy
and
using
corpus
counts
to
determine
how
specific
or
general
this
concept
is
.
In
general
,
the
more
specific
the
lcs
is
and
the
smaller
the
difference
of
its
specificity
with
that
of
the
target
concepts
,
the
closer
the
target
concepts
are
.
12As
GermaNet
does
not
have
glosses
for
synsets
,
Gurevych
(
2005
)
proposed
a
way
ofcreating
a
bag-of-words-type
pseudo-gloss
for
a
synset
by
including
the
words
in
the
synset
and
in
synsets
close
to
it
in
the
network
.
(
2005
)
and
Zesch
et
al.
(
2007
)
asked
native
German
speakers
to
mark
two
different
sets
of
German
word
pairs
with
distance
values
.
Set
1
(
Gur65
)
consists
of
a
German
translation
of
the
English
Rubenstein
and
Goodenough
(
1965
)
dataset
.
It
has
65
noun-noun
word
pairs
.
Set
2
(
Gur350
)
is
a
larger
dataset
containing
350
word
pairs
made
up
of
nouns
,
verbs
,
and
adjectives
.
The
semantically
close
word
pairs
in
Gur65
are
mostly
synonyms
or
hypernyms
(
hy-ponyms
)
of
each
other
,
whereas
those
in
Gur350
have
both
classical
and
non-classical
relations
(
Morris
and
Hirst
,
2004
)
with
each
other
.
Details
of
these
semantic
distance
benchmarks13
are
summarized
in
Table
2
.
Inter-subject
correlations
are
indicative
of
the
degree
of
ease
in
annotating
the
datasets
.
Word-pair
distances
determined
using
different
distance
measures
are
compared
in
two
ways
with
the
two
human-created
benchmarks
.
The
rank
ordering
of
the
pairs
from
closest
to
most
distant
is
evaluated
with
Spearman
's
rank
order
correlation
p
;
the
distance
judgments
themselves
are
evaluated
with
Pearson
's
correlation
coefficient
r.
The
higher
the
correlation
,
the
more
accurate
the
measure
is
.
Spearman
's
correlation
ignores
actual
distance
values
after
a
list
is
ranked
—
only
the
ranks
of
the
two
sets
of
word
pairs
are
compared
to
determine
correlation
.
On
the
other
hand
,
Pearson
's
coefficient
takes
into
account
actual
distance
values
.
So
even
if
two
lists
are
ranked
the
same
,
but
one
has
distances
be
-
13The
datasets
are
publicly
available
at
:
http
:
/
/
www.ukp.tu-darmstadt.de
/
data
/
semRelDatasets
tween
consecutively-ranked
word-pairs
more
in
line
with
human-annotations
of
distance
than
the
other
,
then
Pearson
's
coefficient
will
capture
this
difference
.
However
,
this
makes
Pearson
's
coefficient
sensitive
to
outlier
data
points
,
and
so
one
must
interpret
the
Pearson
correlations
with
caution
.
Table
3
shows
the
results.14
Observe
that
on
both
datasets
and
by
both
measures
of
correlation
,
cross-lingual
measures
of
concept-distance
perform
not
just
as
well
as
the
best
monolingual
measures
,
but
in
fact
better
.
In
general
,
the
correlations
are
lower
for
Gur350
as
it
contains
cross-PoS
word
pairs
and
non-classical
relations
,
making
it
harder
to
judge
even
by
humans
(
as
shown
by
the
inter-annotator
correlations
for
the
datasets
in
Table
2
)
.
Considering
Spearman
's
rank
correlation
,
a-skew
divergence
and
Jensen-Shannon
divergence
perform
best
on
both
datasets
.
The
correlations
of
cosine
and
Lindist
are
not
far
behind
.
Amongst
the
monolingual
GermaNet
measures
,
radial
pseudo-gloss
performs
best
.
Considering
Pearson
's
correlation
,
Lindist
performs
best
overall
and
radial
pseudo-gloss
does
best
amongst
the
monolingual
measures
.
Thus
,
we
see
that
on
both
datasets
and
as
per
both
measures
of
correlation
,
the
cross-lingual
measures
perform
not
just
as
well
as
the
best
monolingual
measures
,
but
indeed
slightly
better
.
4.2
Solving
word
choice
problems
from
Reader
's
Digest
Issues
of
the
German
edition
of
Reader
s
Digest
include
a
word
choice
quiz
called
'
Word
Power
'
.
Each
question
has
one
target
word
and
four
alternative
words
or
phrases
;
the
objective
is
to
pick
the
alternative
that
is
most
closely
related
to
the
target
.
The
correct
answer
may
be
a
near-synonym
of
the
target
or
it
may
be
related
to
the
target
by
some
other
classical
or
non-classical
relation
(
usually
the
former
)
.
For
example
:
15
Duplikat
(
duplicate
)
a.
Einzelstuck
(
single
copy
)
b.
Doppelkinn
(
double
chin
)
c.
Nachbildung
(
replica
)
d.
Zweitschrift
(
copy
)
Our
approach
to
evaluating
distance
measures
fol
-
14In
Table
3
,
all
values
are
statistically
significant
at
the
0.01
level
(
2-tailed
)
,
except
for
the
one
in
italic
(
0.212
)
,
which
is
significant
at
the
0.05
level
(
2-tailed
)
.
15English
translations
are
in
parentheses
.
lows
that
of
Jarmasz
and
Szpakowicz
(
2003
)
,
who
evaluated
semantic
similarity
measures
through
their
ability
to
solve
synonym
problems
(
80
TOEFL
(
Landauer
and
Dumais
,
1997
)
,
50
ESL
(
Turney
,
2001
)
,
and
300
(
English
)
Reader
s
Digest
Word
Power
questions
)
.
Turney
(
2006
)
used
a
similar
approach
to
evaluate
the
identification
of
semantic
relations
,
with
374
college-level
multiple-choice
word
analogy
questions
.
The
Reader
's
Digest
Word
Power
(
RDWP
)
benchmark
for
German
consists
of
1072
of
these
word-choice
problems
collected
from
the
January
2001
to
December
2005
issues
of
the
German-language
edition
(
Wallace
and
Wallace
,
2005
)
.
We
discarded
44
problems
that
had
more
than
one
correct
answer
,
and
20
problems
that
used
a
phrase
instead
of
a
single
term
as
the
target
.
The
remaining
1008
problems
form
our
evaluation
dataset
,
which
is
significantly
larger
than
any
of
the
previous
datasets
employed
in
a
similar
evaluation
.
We
evaluate
the
various
cross-lingual
and
monolingual
distance
measures
by
their
ability
to
choose
the
correct
answer
.
The
distance
between
the
target
and
each
of
the
alternatives
is
computed
by
a
measure
,
and
the
alternative
that
is
closest
is
chosen
.
If
two
or
more
alternatives
are
equally
close
to
the
target
,
then
the
alternatives
are
said
to
be
tied
.
If
one
of
the
tied
alternatives
is
the
correct
answer
,
then
the
problem
is
counted
as
correctly
solved
,
but
the
corresponding
score
is
reduced
.
We
assign
a
score
of
0.5
,
0.33
,
and
0.25
for
2
,
3
,
and
4
tied
alternatives
,
respectively
(
in
effect
approximating
the
score
obtained
by
randomly
guessing
one
of
the
tied
alternatives
)
.
If
more
than
one
alternative
has
a
sense
in
common
with
the
target
,
then
the
thesaurus-based
cross-lingual
measures
will
mark
them
each
as
the
closest
sense
.
However
,
if
one
or
more
of
these
tied
alternatives
is
in
the
same
semicolon
group
of
the
thesaurus16
as
the
target
,
then
only
these
are
chosen
as
the
closest
senses
.
The
German
RDWP
dataset
contains
many
phrases
that
cannot
be
found
in
the
knowledge
sources
(
GermaNet
or
Macquarie
Thesaurus
via
translation
list
)
.
In
these
cases
,
we
remove
stop
-
16Words
in
a
thesaurus
category
are
further
partitioned
into
different
paragraphs
and
each
paragraph
into
semicolon
groups
.
Words
within
a
semicolon
group
are
more
closely
related
than
those
in
semicolon
groups
ofthe
same
paragraph
or
category
.
Table
3
:
Correlations
of
distance
measures
with
human
judgments
.
Reader
's
Digest
Word
Power
benchmark
Measure
Att
.
Cor
.
Ties
Score
P
R
F
Monolingual
Cross-lingual
Table
4
:
Performance
of
distance
measures
on
word
choice
problems
.
(
Att
.
:
Attempted
,
Cor
.
:
Correct
)
words
(
prepositions
,
articles
,
etc.
)
and
split
the
phrase
into
component
words
.
As
German
words
in
a
phrase
can
be
highly
inflected
,
we
lemmatize
all
components
.
For
example
,
the
target
'
imaginaf
(
imaginary
)
has
'
nur
in
der
Vorstellung
vorhanderi
(
'
exists
only
in
the
imagination
'
)
as
one
of
its
alternatives
.
The
phrase
is
split
into
its
component
words
nur
,
Vorstellung
,
and
vorhanden
.
We
compute
semantic
distance
between
the
target
and
each
phrasal
component
and
select
the
minimum
value
as
the
distance
between
target
and
potential
answer
.
Table
4
presents
the
results
obtained
on
the
German
RDWP
benchmark
for
both
monolingual
and
cross-lingual
measures
.
Only
those
questions
for
which
the
measures
have
some
distance
information
are
attempted
;
the
column
'
Att
.
'
shows
the
number
of
questions
attempted
by
each
measure
,
which
is
the
maximum
score
that
the
measure
can
hope
to
get
.
Observe
that
the
thesaurus-based
cross-lingual
measures
have
a
much
larger
coverage
than
the
GermaNet-based
monolingual
measures
.
The
cross-lingual
measures
have
a
much
larger
number
of
correct
answers
too
(
column
'
Cor
.
'
)
,
but
this
number
is
bloated
due
to
the
large
number
of
ties.17
'
Score
'
is
the
score
each
measure
gets
after
it
is
penalized
for
the
ties
.
The
cross-lingual
measures
Cos
,
JSD
,
and
Lindist
obtain
the
highest
scores
.
But
'
Score
'
by
itself
does
not
present
the
complete
picture
ei
-
17We
see
more
ties
when
using
the
cross-lingual
measures
because
they
rely
on
the
Macquarie
Thesaurus
,
a
very
coarsegrained
sense
inventory
(
around
800
categories
)
,
whereas
the
cross-lingual
measures
operate
on
the
fine-grained
GermaNet
.
ther
as
,
given
the
scoring
scheme
,
a
measure
that
attempts
more
questions
may
get
a
higher
score
just
from
random
guessing
.
We
therefore
present
precision
,
recall
,
and
F-scores
(
P
=
Score
/
Att
;
R
=
Score
/
1008
;
F
=
2
x
P
x
R
/
(
P
+
R
)
)
.
Observe
that
the
cross-lingual
measures
have
a
higher
coverage
(
recall
)
than
the
monolingual
measures
but
lower
precision
.
The
F
scores
show
that
the
best
cross-lingual
measures
do
slightly
better
than
the
best
monolingual
ones
,
despite
the
large
number
of
ties
.
The
measures
of
Cos
,
JSD
,
and
Lindist
remain
the
best
cross-lingual
measures
,
whereas
HPG
and
RPG
are
the
best
monolingual
ones
.
5Conclusion
We
have
proposed
a
new
method
to
determine
semantic
distance
in
a
possibly
resource-poor
language
by
combining
its
text
with
a
knowledge
source
in
a
different
,
preferably
resource-rich
,
language
.
Specifically
,
we
combined
German
text
with
an
English
thesaurus
to
create
cross-lingual
distributional
profiles
of
concepts
—
the
strengths
of
association
between
English
thesaurus
senses
(
concepts
)
of
German
words
and
co-occurring
German
words
—
using
a
German-English
bilingual
lexicon
and
a
bootstrapping
algorithm
designed
to
overcome
ambiguities
of
word-senses
and
translations
.
Notably
,
we
do
so
without
the
use
of
sense-annotated
text
or
word-aligned
parallel
corpora
.
We
did
not
parse
or
chunk
the
text
,
nor
did
we
stem
,
lemmatize
,
or
part-of-speech-tag
the
words
.
We
used
the
cross-lingual
DPCs
to
estimate
semantic
distance
by
developing
new
cross-lingual
distributional
measures
of
concept-distance
.
These
measures
are
like
the
distributional
measures
of
concept-distance
(
Mohammad
and
Hirst
,
2006a
,
2006b
)
,
except
they
can
determine
distance
between
words
in
one
language
using
a
thesaurus
in
a
different
language
.
We
evaluated
the
cross-lingual
measures
against
the
best
monolingual
ones
operating
on
a
WordNet-like
resource
,
GermaNet
,
through
an
extensive
set
of
experiments
on
two
different
German
semantic
distance
benchmarks
.
In
the
process
,
we
compiled
a
large
German
benchmark
of
Reader
's
Digest
word
choice
problems
suitable
for
evaluating
semantic-relatedness
measures
.
Most
previous
semantic
distance
benchmarks
are
either
much
smaller
or
cater
primarily
to
semantic
similarity
measures
.
Even
with
the
added
ambiguity
of
translating
words
from
one
language
to
another
,
the
cross-lingual
measures
performed
better
than
the
best
monolingual
measures
on
both
the
word-pair
task
and
the
Reader
s
Digest
word-choice
task
.
Further
,
in
the
word-choice
task
,
the
cross-lingual
measures
achieved
a
significantly
higher
coverage
than
the
monolingual
measure
.
The
richness
of
English
resources
seems
to
have
a
major
impact
,
even
though
German
,
with
GermaNet
,
a
well-established
resource
,
is
in
a
better
position
than
most
other
languages
.
This
is
indeed
promising
,
because
achieving
broad
coverage
for
resource-poor
languages
remains
an
important
goal
as
we
integrate
state-of-the-art
approaches
in
natural
language
processing
into
reallife
applications
.
These
results
show
that
our
algorithm
can
successfully
combine
German
text
with
an
English
thesaurus
using
a
bilingual
German-English
lexicon
to
obtain
state-of-the-art
results
in
measuring
semantic
distance
.
These
results
also
support
the
broader
and
far-reaching
claim
that
natural
language
problems
in
a
resource-poor
language
can
be
solved
using
a
knowledge
source
in
a
resource-rich
language
(
e.g.
,
Cucerzan
and
Yarowsky
's
(
2002
)
cross-lingual
PoS
tagger
)
.
Our
future
work
will
explore
other
tasks
such
as
information
retrieval
and
text
categorization
.
Cross-lingual
DPCs
also
have
tremendous
potential
in
tasks
inherently
involving
more
than
one
language
,
such
as
machine
translation
and
multi-language
multi-document
summarization
.
We
believe
that
the
future
of
natural
language
processing
lies
not
in
standalone
monolingual
systems
but
in
those
that
are
powered
by
automatically
created
multilingual
networks
of
information
.
Acknowledgments
We
thank
Philip
Resnik
,
Michael
Demko
,
Suzanne
Stevenson
,
Frank
Rudicz
,
Afsaneh
Fazly
,
and
Afra
Alishahi
for
helpful
discussions
.
This
research
is
financially
supported
by
the
Natural
Sciences
and
Engineering
Research
Council
of
Canada
,
the
University
of
Toronto
,
the
German
Research
Foundation
under
the
grant
"
Semantic
Information
Retrieval
"
(
SIR
)
,
GU
798
/
1-2
.
