Many
systems
for
tasks
such
as
question
answering
,
multi-document
summarization
,
and
information
retrieval
need
robust
numerical
measures
of
lexical
relatedness
.
Standard
thesaurus-based
measures
of
word
pair
similarity
are
based
on
only
a
single
path
between
those
words
in
the
thesaurus
graph
.
By
contrast
,
we
propose
a
new
model
of
lexical
semantic
relatedness
that
incorporates
information
from
every
explicit
or
implicit
path
connecting
the
two
words
in
the
entire
graph
.
Our
model
uses
a
random
walk
over
nodes
and
edges
derived
from
WordNet
links
and
corpus
statistics
.
We
treat
the
graph
as
a
Markov
chain
and
compute
a
word-specific
stationary
distribution
via
a
generalized
PageRank
algorithm
.
Semantic
relatedness
of
a
word
pair
is
scored
by
a
novel
divergence
measure
,
ZKL
,
that
outperforms
existing
measures
on
certain
classes
of
distributions
.
In
our
experiments
,
the
resulting
relatedness
measure
is
the
WordNet-based
measure
most
highly
correlated
with
human
similarity
judgments
by
rank
ordering
at
p
=
.
90
.
1
Introduction
Several
kinds
of
Natural
Language
Processing
systems
need
measures
of
semantic
relatedness
for
arbitrary
word
pairs
.
For
example
,
document
summarization
and
question
answering
systems
often
use
similarity
scores
to
evaluate
candidate
sentence
alignments
,
and
information
retrieval
systems
use
relatedness
scores
for
query
expansion
.
Several
popular
algorithms
calculate
scores
from
information
contained
in
WordNet
(
Fellbaum
,
1998
)
,
an
electronic
dictionary
where
word
senses
are
explicitly
connected
by
zero
or
more
semantic
relationships
.
The
central
challenge
of
these
algorithms
is
to
compute
reasonable
relatedness
scores
for
arbitrary
word
pairs
given
that
few
pairs
are
directly
connected
.
Most
pairs
in
WordNet
share
no
direct
semantic
link
,
and
for
some
the
shortest
connecting
path
can
be
surprising
—
even
pairs
that
seem
intuitively
related
,
such
"
furnace
"
and
"
stove
"
share
a
lowest
common
ancestor
in
the
hypernymy
taxonomy
(
is-a
links
)
all
the
way
up
at
"
artifact
"
(
a
man-made
object
)
.
Several
existing
algorithms
compute
relatedness
only
by
traversing
the
hyper-nymy
taxonomy
and
find
that
"
furnace
"
and
"
stove
"
are
relatively
unrelated
.
However
,
WordNet
provides
other
types
of
semantic
links
in
addition
to
hypernymy
,
such
as
meronymy
(
part
/
whole
relationships
)
,
antonymy
,
and
verb
entailment
,
as
well
as
implicit
links
defined
by
overlap
in
the
text
of
definitional
glosses
.
These
links
can
provide
valuable
relatedness
information
.
If
we
assume
that
relatedness
is
transitive
across
a
wide
variety
of
such
links
,
then
it
is
natural
to
follow
paths
such
as
furnace-crematory-gas
oven-oven-kitchen
appliance-stove
and
find
a
higher
degree
of
relatedness
between
"
furnace
"
and
"
stove
.
"
This
paper
presents
the
application
of
random
walk
Markov
chain
theory
to
measuring
lexical
semantic
re-latedness
.
A
graph
of
words
and
concepts
is
constructed
from
WordNet
.
The
random
walk
model
posits
the
existence
of
a
particle
that
roams
this
graph
by
stochastically
following
local
semantic
relational
links
.
The
particle
is
biased
toward
exploring
the
neighborhood
around
a
target
word
,
and
is
allowed
to
roam
until
the
proportion
of
time
it
visits
each
node
in
the
limit
converges
to
a
stationary
distribution
.
In
this
way
we
can
compute
distinct
,
word-specific
probability
distributions
over
how
often
a
particle
visits
all
other
nodes
in
the
graph
when
"
starting
"
from
a
specific
word
.
We
compute
the
relatedness
of
two
words
as
the
similarity
of
their
stationary
distributions
.
The
random
walk
brings
with
it
two
distinct
advantages
.
First
,
it
enables
the
similarity
measure
to
have
a
principled
means
of
combination
of
multiple
types
of
edges
from
WordNet
.
Second
,
by
traversing
all
links
,
the
walk
aggregates
local
similarity
statistics
across
the
entire
graph
.
The
similarity
scores
produced
by
our
method
are
,
to
our
knowledge
,
the
WordNet-based
scores
most
highly
correlated
with
human
judgments
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
581-589
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
2
Related
work
Budanitsky
and
Hirst
(
2006
)
provide
a
survey
of
many
WordNet-based
measures
of
lexical
similarity
based
on
paths
in
the
hypernym
taxonomy
.
As
an
example
,
one
of
the
best
performing
is
the
measure
proposed
by
Jiang
and
Conrath
(
1997
)
(
similar
to
the
one
proposed
by
(
Lin
,
1991
)
)
,
which
finds
the
shortest
path
in
the
taxonomic
hierarchy
between
two
candidate
words
before
computing
similarity
as
a
function
of
the
information
content
of
the
two
words
and
their
lowest
common
subsumer
in
the
hierarchy
.
We
note
the
distinction
between
word
similarity
and
word
relatedness
.
Similarity
is
a
special
case
of
relat-edness
in
that
related
words
such
as
"
cat
"
and
"
fur
"
share
some
semantic
relationships
(
such
as
meronymy
)
,
but
do
not
express
the
same
likeness
of
form
as
would
similar
words
such
as
"
cat
"
and
"
lion
.
"
The
Jiang-Conrath
measure
and
most
other
measures
that
primarily
make
use
of
of
hypernymy
(
is-a
links
)
in
the
WordNet
graph
are
better
categorized
as
measures
of
similarity
than
of
relatedness
.
Other
measures
have
been
proposed
that
utilize
the
text
in
WordNet
's
definitional
glosses
,
such
as
Extended
Lesk
(
Banerjee
and
Pedersen
,
2003
)
and
later
the
Gloss
Vectors
(
Patwardhan
and
Pedersen
,
2006
)
method
.
These
approaches
are
primarily
based
on
comparing
the
"
bag
of
words
"
of
two
synsets
'
gloss
text
concatenated
with
the
text
of
neighboring
words
'
glosses
in
the
taxonomy
.
As
a
result
,
these
gloss-based
methods
measure
relatedness
.
Our
model
captures
some
of
this
relatedness
information
by
including
weighted
links
based
on
gloss
text
.
A
variety
of
other
measures
of
semantic
relatedness
have
been
proposed
,
including
distributional
similarity
measures
based
on
co-occurrence
in
a
body
of
text
—
see
(
Weeds
and
Weir
,
2005
)
for
a
survey
.
Other
measures
make
use
of
alternative
structured
information
resources
than
WordNet
,
such
as
Roget
's
thesaurus
(
Jar-masz
and
Szpakowicz
,
2003
)
.
More
recently
,
measures
incorporating
information
from
Wikipedia
(
Gabrilovich
and
Markovitch
,
2007
;
Strube
and
Ponzetto
,
2006
)
have
reported
stronger
results
on
some
tasks
than
have
been
achieved
by
existing
measures
based
on
shallower
lexical
resources
.
The
results
of
our
algorithm
are
competitive
with
some
Wikipedia
algorithms
while
using
only
WordNet
2.1
as
the
underlying
lexical
resource
.
The
approach
presented
here
is
generalizable
to
construction
from
any
underlying
semantic
resource
.
PageRank
is
the
most
well-known
example
of
a
random
walk
Markov
chain
—
see
(
Berkhin
,
2005
)
for
a
survey
.
It
uses
the
local
hyperlink
structure
of
the
web
to
define
a
graph
which
it
walks
to
aggregate
popularity
information
for
different
pages
.
Recent
work
has
applied
random
walks
to
NLP
tasks
such
as
PP
attachment
(
Toutanova
et
al.
,
2004
)
,
word
sense
disambiguation
(
Mi-halcea
,
2005
;
Tarau
et
al.
,
2005
)
,
and
query
expansion
(
Collins-Thompson
and
Callan
,
2005
)
.
However
,
to
our
knowledge
,
the
literature
in
NLP
has
only
considered
using
one
stationary
distribution
per
specially-constructed
graph
as
a
probability
estimator
.
In
this
paper
,
we
introduce
a
measure
of
semantic
relatedness
based
on
the
divergence
of
the
distinct
stationary
distributions
resulting
from
random
walks
centered
at
different
positions
in
the
word
graph
.
We
believe
we
are
the
first
to
define
such
a
measure
.
3
Random
walks
on
WordNet
Our
model
is
based
on
a
random
walk
of
a
particle
through
a
simple
directed
graph
G
=
(
V
,
E
)
whose
nodes
V
and
edges
E
are
extracted
from
WordNet
version
2.1
.
Formally
,
we
define
the
probability
nj^
of
finding
the
particle
at
node
n
e
V
at
time
t
as
the
sum
of
all
ways
in
which
the
particle
could
have
reached
n
from
any
other
node
at
the
previous
time-step
:
where
P
(
n
|
nj
)
is
the
conditional
probability
of
moving
to
nj
given
that
the
particle
is
at
nj.
In
particular
,
we
construct
the
transition
distribution
such
that
P
(
nj
|
nj
)
&gt;
0
whenever
WordNet
specifies
a
local
link
relationship
of
the
form
j
—
i.
Note
that
this
random
walk
is
a
Markov
chain
because
the
transition
probabilities
at
time
t
are
independent
of
the
particle
's
past
trajectory
.
The
subsections
that
follow
present
the
construction
of
the
graph
for
our
random
walk
from
WordNet
and
the
mathematics
of
computing
the
stationary
distribution
for
a
given
word
.
3.1
Graph
Construction
WordNet
is
itself
a
graph
over
synsets
.
A
synset
is
best
thought
of
as
a
concept
evoked
by
one
sense
of
one
or
more
words
.
For
instance
,
different
senses
of
the
word
"
bank
"
take
part
in
different
synsets
(
e.g.
a
river
bank
versus
a
financial
institution
)
,
and
a
single
synset
can
be
represented
by
multiple
synonymous
words
,
such
as
"
middle
"
and
"
center
.
"
WordNet
explicitly
marks
semantic
relationships
between
synsets
,
but
we
are
additionally
interested
in
representing
relatedness
between
words
.
We
therefore
extract
the
following
types
of
nodes
from
WordNet
:
Synset
Each
WordNet
synset
has
a
corresponding
node
.
For
example
,
one
node
corresponds
to
the
synset
referred
to
by
"
dog#n#3
,
"
the
third
sense
of
dog
as
noun
,
whose
meaning
is
"
an
informal
term
for
a
man
.
"
There
are
117,597
Synset
nodes
.
TokenPOS
One
node
is
allocated
to
every
word
coupled
with
a
part
of
speech
,
such
as
"
dog#n
"
meaning
dog
as
a
noun
.
These
nodes
link
to
all
the
synsets
they
participate
in
,
so
that
"
dog#n
"
links
the
Synset
nodes
for
canine
,
hound
,
hot
dog
,
etc.
Collocations
—
multi-word
expressions
such
as
"
hot
dog
"
—
that
take
part
in
a
synsets
are
also
represented
by
these
nodes
.
There
are
156,588
TokenPOS
nodes
.
Token
Every
TokenPOS
is
connected
to
a
Token
node
corresponding
to
the
word
when
no
part
of
speech
information
is
present
.
For
example
,
"
dog
"
links
to
"
dog#n
"
and
"
dog#v
"
(
meaning
"
to
chase
"
)
.
There
are
148,646
Token
nodes
.
Synset
nodes
are
connected
with
edges
corresponding
to
many
of
the
relationship
types
in
WordNet
.
We
use
these
WordNet
relationships
to
form
edges
:
hypernym
/
hyponym
,
instance
/
instance
of
,
all
holonym
/
meronym
links
,
antonym
,
entails
/
entailed
by
,
adjective
satellite
,
causes
/
caused
by
,
participle
,
pertains
to
,
derives
/
derived
from
,
attribute
/
has
attribute
,
and
topical
(
but
not
regional
or
usage
)
domain
links
.
By
construction
,
each
edge
created
from
a
WordNet
relationship
is
guaranteed
to
have
a
corresponding
edge
in
the
opposite
direction
.
Edges
that
connect
a
TokenPOS
to
the
Synsets
using
it
are
weighted
based
on
a
Bayesian
estimate
drawn
from
the
SemCor
frequency
counts
included
in
WordNet
but
with
a
non-uniform
Dirichlet
prior
.
Our
edge
weights
are
the
SemCor
frequency
counts
for
each
target
Synset
,
with
pseudo-counts
of
.
1
for
all
Synsets
,
1
for
first
sense
of
each
word
,
and
.
1
for
the
first
word
in
each
Synset
.
Intuitively
,
this
causes
the
particle
to
have
a
higher
probability
of
moving
to
more
common
senses
of
a
TokenPOS
;
for
example
,
the
edges
from
"
dog#n
"
to
"
dog#n#1
"
(
canine
)
and
"
dog#n#5
"
(
hotdog
)
have
un-normalized
weights
of
43.2
and
0.1
,
respectively
.
The
edges
connecting
a
Token
to
the
TokenPOS
nodes
in
which
it
can
occur
are
also
weighted
by
the
sum
of
the
weights
of
the
outgoing
To-kenPOS
—
Synset
links
.
Hence
a
walk
starting
at
a
common
word
like
"
cat
"
is
far
more
likely
to
follow
a
link
to
"
cat#n
"
than
to
rarities
like
"
cat#v
"
(
to
vomit
)
.
These
edges
are
uni-directional
;
no
edges
are
created
from
a
Synset
to
a
TokenPOS
that
can
represent
the
Synset
.
In
order
for
our
graph
construction
to
incorporate
textual
gloss-based
information
,
we
also
create
unidirectional
edges
from
Synset
nodes
to
the
TokenPOS
nodes
for
the
words
and
collocations
used
in
that
synset
's
gloss
definition
.
This
requires
part-of-speech
tagging
the
glosses
,
for
which
we
use
the
Stanford
maximum
entropy
tagger
(
Toutanova
et
al.
,
2003
)
.
It
is
important
to
correctly
weight
these
edges
,
because
high-frequency
stop-words
such
as
"
by
"
and
"
he
"
do
not
convey
much
information
and
might
serve
only
to
smear
the
probability
mass
across
the
whole
graph
.
Gloss-based
links
to
these
nodes
should
therefore
be
down-weighted
or
removed
.
On
the
other
hand
,
up-weighting
extremely
rare
words
such
as
by
tf-idf
scoring
might
also
be
inappropriate
because
such
rare
words
would
get
extremely
high
scores
,
which
is
an
undesirable
trait
in
similarity
search
.
(
Haveli-wala
et
al.
,
2002
)
and
others
have
shown
that
a
"
nonmonotonic
document
frequency
"
(
NMDF
)
weighting
can
be
more
effective
in
such
a
setting
.
Because
the
frequency
of
words
in
the
glosses
is
distributed
by
a
power-law
,
we
weight
each
word
by
its
distance
from
the
mean
word
count
in
log
space
.
Formally
,
the
weight
w
for
a
word
appearing
r4
times
is
where
p
and
a
are
the
mean
and
standard
deviation
of
the
logs
of
all
word
counts
.
This
is
a
smooth
approximation
to
the
high
and
low
frequency
stop
lists
used
effectively
by
other
measures
such
as
(
Patwardhan
and
Ped-ersen
,
2006
)
.
We
believe
that
because
non-monotonic
frequency
scaling
has
no
parameters
and
is
data-driven
,
it
could
stand
to
be
more
widely
adopted
among
gloss-based
lexical
similarity
measures
.
We
also
add
bi-directional
edges
between
Synsets
whose
word
senses
overlap
with
a
common
TokenPOS
.
These
edges
have
raw
weights
given
by
the
number
of
TokenPOS
nodes
shared
by
the
Synsets
.
The
intuition
behind
adding
these
edges
is
that
WordNet
often
divides
the
meanings
of
words
into
fine-grained
senses
with
similar
meanings
,
so
there
is
likely
to
be
some
semantic
relationship
between
Synsets
sharing
a
common
TokenPOS
.
The
final
graph
has
422,831
nodes
and
5,133,281
edges
.
This
graph
is
very
sparse
;
fewer
than
1
in
10,000
node
pairs
are
directly
connected
.
When
only
the
unweighted
WordNet
relationship
edges
are
considered
,
the
largest
degree
of
any
node
is
"
city#n#1
"
with
667
edges
(
mostly
connecting
to
particular
cities
)
,
followed
by
"
law#n#2
"
with
602
edges
(
mostly
connecting
to
a
large
number
of
domain
terms
such
as
"
dissenting
opinion
"
and
"
freedom
of
speech
"
)
,
and
each
node
is
on
average
connected
to
1.7
other
nodes
.
When
the
gloss-based
edges
are
considered
separately
,
the
highest
degree
nodes
are
those
with
the
longest
definitions
;
the
maximum
out-degree
is
56
and
the
average
out-degree
is
6.2
.
For
the
edges
linking
TokenPOS
nodes
to
the
Synsets
in
which
they
participate
,
TokenPOS
nodes
with
many
senses
are
the
most
connected
;
"
break#v
"
with
59
outgoing
edges
and
"
make#v
"
with
49
outgoing
edges
have
the
highest
out-degrees
,
with
the
average
out-degree
being
1.3
.
3.2
Computing
the
stationary
distribution
Each
of
the
K
edge
types
presented
above
can
be
represented
as
separate
transition
matrix
Ek
e
RN
xN
where
N
is
the
total
number
of
nodes
.
For
each
matrix
,
column
j
contains
contains
a
normalized
outgoing
probability
distribution,1
so
the
weight
in
cell
contains
PK
(
n
i
nj
)
,
the
conditional
probability
of
moving
from
node
nj
to
node
n
in
edge
type
K.
For
many
of
the
edge
types
,
this
is
either
0
or
1
,
but
for
the
weighted
edges
,
these
are
real
valued
.
The
full
transition
matrix
M
is
then
the
column
normalized
sum
of
all
of
the
edge
types
:
Mis
a
distillation
of
relevant
relatedness
information
about
all
nodes
extracted
from
WordNet
and
is
not
tailored
for
computing
a
stationary
distribution
for
any
specific
word
.
In
order
to
compute
the
stationary
distribution
vdog#n
for
a
walk
centered
around
the
TokenPOS
"
dog#n
,
"
we
first
define
an
initial
distribution
v
places
all
the
probability
mass
in
the
single
vector
entry
corresponding
to
"
dog#n
.
"
Then
at
every
step
of
the
walk
,
we
will
return
to
v
(
0
)
with
probability
p.
Intuitively
,
this
return
probability
captures
the
notion
that
nodes
close
to
"
dog#n
"
should
be
given
higher
weight
,
and
also
guarantees
that
the
stationary
distribution
exists
and
is
unique
(
Bremaud
,
1999
)
.
The
stationary
distribution
v
is
computed
via
an
iterative
update
algorithm
:
Because
the
walk
may
return
to
the
initial
distribution
v
(
0
)
at
any
step
with
probability
p
,
we
found
that
v
(
t
)
converges
to
its
unique
stationary
distribution
v
(
oo
)
in
a
number
of
steps
roughly
proportional
to
p-1
.
We
experimented
with
a
range
of
return
probabilities
and
found
that
our
results
were
relatively
insensitive
to
this
parameter
.
Our
convergence
criteria
was
which
,
for
our
graph
with
a
return
probability
of
p
=
.
1
,
was
met
after
about
two
dozen
iterations
.
This
computation
takes
under
two
seconds
on
a
modern
desktop
machine
.
Note
that
because
M
is
sparse
,
each
iteration
of
the
above
computation
is
linear
in
the
total
number
of
nonzero
entries
in
P
,
i.e.
linear
in
the
total
number
of
edges
.
Introducing
an
edge
type
that
is
dense
would
dramatically
increase
running
time
.
For
this
paper
,
we
consider
three
model
variants
that
differ
based
on
which
subset
of
the
edge
types
are
included
1The
frequency-count
derived
edges
are
normalized
by
the
largest
column
sum
.
This
effectively
preserves
relative
term
frequency
information
across
the
graph
and
causes
some
columns
to
sum
to
less
than
one
.
We
interpret
this
lost
mass
as
a
link
to
"
nowhere
.
"
in
the
transition
matrix
M.
MarkovLink
This
variant
includes
the
explicit
WordNet
relations
such
as
hypernymy
and
the
edges
representing
overlap
between
the
TokenPOS
nodes
contained
in
Synsets
.
A
particle
walking
through
this
graph
reaches
only
Synset
nodes
and
can
step
from
one
Synset
to
another
whenever
WordNet
specifies
a
relationship
between
the
Synsets
or
when
the
Synsets
share
a
common
word
.
There
is
a
single
connected
component
in
this
model
variant
.
This
model
is
loosely
analogous
to
a
smoothed
version
of
the
path-based
WordNet
measures
surveyed
in
(
Budanitsky
and
Hirst
,
2006
)
but
differs
in
that
it
integrates
multiple
link
types
and
aggregates
relatedness
information
across
all
paths
in
the
graph
.
MarkovGloss
This
variant
includes
only
the
weighted
uni-directional
edges
linking
Synsets
to
the
TokenPOS
nodes
contained
in
their
gloss
definitions
,
and
the
edges
from
a
TokenPOS
node
to
the
Synsets
containing
it
.
The
intuition
behind
this
model
variant
is
that
the
particle
can
move
as
if
it
were
recursively
looking
up
words
in
a
dictionary
,
stepping
from
Synsets
to
the
Synsets
used
to
define
them
.
Because
WordNet
's
gloss
definitions
are
not
sense-tagged
,
the
particle
must
make
an
intermediate
step
to
a
To-kenPOS
contained
in
the
gloss
definition
and
then
to
a
Synset
representing
a
particular
sense
of
that
TokenPOS
.
The
availability
of
sense-tagged
glosses
would
eliminate
the
noise
introduced
by
this
intermediate
step
.
The
particle
can
reach
both
Synsets
and
TokenPOS
nodes
in
this
variant
,
but
some
parts
of
the
graph
are
not
reachable
from
other
parts
.
This
model
incorporates
much
of
the
same
information
as
the
gloss-based
WordNet
measures
(
Banerjee
and
Pedersen
,
2003
;
Patwardhan
and
Pedersen
,
2006
)
but
differs
in
that
it
considers
many
more
glosses
than
just
those
in
the
immediate
neighborhoods
of
the
candidate
words
.
MarkovJoined
This
variant
is
the
natural
combination
of
the
above
two
;
we
construct
the
graph
containing
WordNet
relation
edges
,
Synset
overlap
edges
,
and
gloss-based
Synset
to
TokenPOS
edges
.
Many
of
the
characteristics
of
the
model
variants
can
be
understood
in
terms
of
how
much
probability
mass
they
assign
to
each
node
for
a
particular
word-specific
stationary
distribution
.
Table
1
shows
the
highest
scoring
nodes
in
the
word-specific
stationary
distributions
centered
around
the
Token
node
for
"
wizard
,
"
as
computed
by
the
MarkovLink
and
MarkovGloss
variants
.
In
both
variants
,
the
"
wizard
"
Token
's
only
neighbors
are
the
"
wiz-ard#n
"
and
"
wizard#a
"
TokenPOS
nodes
,
and
"
wizard#n
"
MarkovLink
MarkovGloss
Probability
wizard#n
wizard#a
dazzlingly#r
sorcery#n
occultist#n#1
Cagliostro#n#1
dazzlingly#r#1
breeze_through#v#1
dazzle#n
beholder#n
dazzle#v
MarkovGloss
model
Table
1
:
Highest
scoring
nodes
in
the
stationary
distributions
for
"
wizard#n
"
as
generated
by
the
MarkovLink
model
and
the
MarkovGloss
model
with
return
probability
0.1
.
has
a
higher
probability
mass
because
of
its
higher
Sem-Cor
usage
counts
.
Likewise
,
the
only
possible
steps
permitted
in
either
variant
from
"
wizard#n
"
and
"
wizard#a
"
are
to
the
Synsets
that
can
be
expressed
with
those
nodes
:
"
ace#n#3
,
"
"
sorcerer#n#1
,
"
and
"
charming#a#1
.
"
Again
,
the
amount
of
mass
given
to
these
nodes
depends
on
the
strength
of
these
edge
weights
,
which
is
determined
by
the
SemCor
usage
counts
.
The
highest
probability
nodes
in
the
table
are
common
because
both
model
variants
share
the
same
initial
links
.
However
,
the
orders
of
the
remaining
nodes
in
the
stationary
distributions
are
different
.
In
the
MarkovLink
variant
,
the
random
walk
can
only
proceed
to
other
Synsets
using
WordNet
relationship
edges
;
"
track
star#n#1
"
and
"
ex-pert#n#1
"
are
first
reached
by
following
hyponym
and
hypernym
edges
from
"
ace#n#1
,
"
and
"
occultist#n#1
"
and
"
Cagliostro#n#1
"
are
first
reached
with
hypernym
and
instance
edges
from
"
sorcerer#n#1
.
"
The
node
"
breeze
through#v#1
"
is
reached
through
a
path
following
derivational
links
with
"
ace#n
"
and
"
ace#v
.
"
The
MarkovGloss
variant
in
table
1
shows
how
information
can
be
extracted
solely
from
the
textual
glosses
.
Once
the
random
walk
reaches
the
first
Synset
nodes
,
it
can
step
to
the
TokenPOS
nodes
in
their
glosses
;
for
example
,
"
ace#n#1
"
has
the
gloss
"
someone
who
is
daz-zlingly
skilled
in
any
field
.
"
Links
to
TokenPOS
nodes
that
are
very
common
in
glosses
are
down-weighted
with
NMDF
weighting
,
so
"
someone#n
"
receives
little
mass
while
"
dazzlingly#r
"
receives
more
.
From
there
,
the
random
walk
can
step
to
another
Synset
such
as
"
daz
-
Figure
1
:
Example
stationary
distributions
plotted
against
each
other
for
similar
(
top
)
and
dissimilar
(
bottom
)
word
pairs
,
using
the
MarkovLink
(
left
)
and
MarkovGloss
(
right
)
model
variants
.
zlingly#r#1
,
"
and
then
on
to
other
TokenPOS
nodes
used
in
its
definition
:
"
in
a
manner
or
to
a
degree
that
dazzles
the
beholder
.
"
Figure
1
demonstrates
how
two
word-specific
stationary
distributions
are
more
highly
correlated
if
the
words
are
related
.
In
both
model
variants
,
random
walks
for
related
words
are
more
likely
to
visit
the
same
parts
of
the
graph
,
and
so
assign
higher
probability
to
the
same
nodes
.
Figure
1
also
shows
that
the
MarkovGloss
variant
produces
distributions
with
a
much
wider
range
of
probabilities
than
the
MarkovLink
,
which
might
be
a
source
of
difficulty
in
integrating
the
two
model
variants
.
Figure
2
shows
the
correlation
between
the
stationary
distributions
produced
by
the
two
model
variants
for
the
same
word
.
The
log-log
scale
makes
it
possible
to
see
the
entire
range
of
probabilities
on
the
same
axes
,
and
shows
that
distributions
produced
by
these
two
model
variants
share
many
of
the
same
highest-probability
words
.
A
noteworthy
property
of
the
constructed
graphs
is
that
word
relatedness
can
be
computed
directly
by
comparing
walks
that
start
at
Token
nodes
.
By
contrast
,
existing
WordNet-based
measures
require
independent
similarity
judgments
for
all
word
senses
relevant
to
a
target
word
pair
(
of
which
the
maximum
relatedness
value
is
usually
taken
)
.
Our
algorithm
lends
itself
to
comparisons
between
walks
centered
at
a
Synset
node
,
or
a
Token-POS
node
,
or
a
Token
node
,
or
any
mixed
distribution
thereof
.
And
because
the
Synset
nodes
are
strongly
connected
,
the
model
also
admits
direct
comparison
across
parts
of
speech
.
Figure
2
:
Correlation
of
the
stationary
distributions
for
"
wizard#n
,
"
produced
by
the
MarkovLink
variant
(
x-axis
)
and
the
MarkovGloss
variant
(
y-axis
)
.
4
Similarity
judgments
We
have
shown
how
to
compute
the
word-specific
stationary
distribution
from
any
starting
distribution
in
the
graph
.
Now
consider
the
task
of
deciding
similarity
between
two
words
.
Intuitively
,
if
the
random
walk
starting
at
the
first
word
's
node
and
the
random
walk
starting
at
the
second
word
's
node
tend
to
visit
the
same
nodes
,
we
would
like
to
consider
them
semantically
related
.
Formally
,
we
measure
the
divergence
of
their
respective
stationary
distributions
,
p
and
q.
A
wide
literature
exists
on
similarity
measures
between
probability
distributions
.
One
standard
choice
is
to
consider
p
and
q
to
be
vectors
and
measure
the
cosine
of
the
angle
between
them
,
which
is
rank
equivalent
to
Euclidean
distance
.
Because
p
and
q
are
probability
distributions
,
we
would
also
expect
a
strong
contender
from
the
information-theoretic
measures
based
on
Kullback-Leibler
divergence
,
defined
as
:
Unfortunately
,
KL
divergence
is
undefined
if
any
qi
is
zero
because
those
terms
in
the
sum
will
have
infinite
weight
.
Several
modifications
to
avoid
this
issue
have
been
proposed
in
the
literature
.
One
is
Jensen-Shannon
divergence
(
Lin
,
1991
)
,
a
symmetric
measure
based
on
KL-divergence
defined
as
the
average
of
the
KL
divergences
of
each
distribution
to
their
average
distribution
.
Jensen-Shannon
is
well
defined
for
all
distributions
because
the
average
of
pi
and
qi
is
non-zero
whenever
either
number
is
.
These
measures
and
others
are
surveyed
in
(
Lee
,
2001
)
,
who
finds
that
Jensen-Shannon
is
outperformed
by
the
Skew
divergence
measure
introduced
by
Lee
in
(
1999
)
.
The
skew
divergence2
accounts
for
zeros
in
q
by
mixing
in
a
small
amount
of
p.
Lee
found
that
as
a
—
1
,
the
performance
of
skew
divergence
on
natural
language
tasks
improves
.
In
particular
,
it
outperforms
most
other
models
and
even
beats
pure
KL
divergence
modified
to
avoid
zeros
with
sophisticated
smoothing
models
.
In
exploring
the
performance
of
divergence
measures
on
our
model
's
stationary
distributions
,
we
observed
the
same
phenomenon
.
Note
that
in
the
limit
as
a
—
1
,
alpha
skew
is
identically
KL-divergence
.
In
this
section
we
introduce
a
novel
measure
of
distributional
divergence
based
on
a
reinterpretation
of
the
skew
divergence
.
Skew
divergence
avoids
zeros
in
q
by
mixing
in
some
of
p
,
but
its
performance
on
many
natural
language
tasks
improves
as
it
better
approximates
KL
divergence
.
We
propose
an
alternative
approximation
to
KL
divergence
called
Zero-KL
divergence
,
or
ZKL
.
When
qi
is
non-zero
,
we
use
exactly
the
term
from
KL
divergence
.
When
qi
=
0
,
we
have
a
problem
—
in
the
limit
as
a
—
1
,
the
corresponding
term
approaches
infinity
.
We
let
ZKL
use
the
Skew
divergence
value
for
these
terms
:
Pi
logaq
■
+
(
i-a
)
p
■
.
Because
qi
=
0
this
simplifies
to
qj
if
qj
=
0
.
We
define
the
Zero-KL
divergence
with
respect
to
2In
Lee
's
(
1999
)
original
presentation
,
skew
divergence
is
defined
not
as
sa
(
p
,
q
)
but
rather
as
sa
(
q
,
p
)
.
We
reverse
the
argument
order
for
consistency
with
the
other
measures
discussed
here
.
Note
that
this
is
exactly
KL-divergence
when
KL-divergence
is
defined
and
,
like
skew
divergence
,
approximates
KL
divergence
in
the
limit
as
7
—
00
.
A
similar
analysis
of
the
skew
divergence
terms
for
when
0
&lt;
qi
&lt;
C
pi
(
and
in
particular
with
qi
less
than
pi
by
more
than
a
factor
of
2-Y
)
shows
that
such
a
term
in
the
skew
divergence
sum
is
again
approximated
by
7
pi
.
ZKL
does
not
have
this
property
.
Because
ZKL
is
a
better
approximation
to
KL
divergence
and
because
they
have
the
same
behavior
in
the
limit
,
we
expect
ZKL
's
performance
to
dominate
that
of
skew
divergence
in
many
distributions
.
However
,
if
there
is
a
wide
range
in
the
exponent
of
noisy
terms
,
the
maximum
possible
penalty
to
such
terms
ascribed
by
skew
divergence
may
be
beneficial
.
Figure
3
shows
the
relative
performance
of
ZKL
versus
Jensen-Shannon
,
skew
divergence
,
cosine
similarity
,
and
the
Jaccard
score
(
a
measure
from
information
retrieval
)
for
correlations
with
human
judgment
on
the
MarkovLink
model
.
ZKL
consistently
outperforms
the
other
measures
on
distributions
resulting
from
this
model
,
but
ZKL
is
not
optimal
on
distributions
generated
by
our
other
models
.
The
next
section
explores
this
topic
in
more
detail
.
5
Evaluation
Traditionally
,
there
have
been
two
primary
types
of
evaluation
for
measures
of
semantic
relatedness
:
one
is
correlation
to
human
judgment
,
the
other
is
the
relative
performance
gains
of
a
task-driven
system
when
it
uses
the
measure
.
The
evaluation
here
focuses
on
correlation
with
human
judgments
of
relatedness
.
For
consistency
with
previous
literature
,
we
use
rank
correlation
(
Spearman
's
p
coefficient
)
rather
than
linear
correlation
when
comparing
sets
of
relatedness
judgments
because
the
rank
correlation
captures
information
about
the
relative
ordering
of
the
scores
.
However
,
it
is
worth
noting
that
many
applications
that
make
use
of
lexical
relatedness
scores
(
e.g.
as
features
to
a
machine
learning
algorithm
)
would
better
be
served
by
scores
on
a
linear
scale
with
human
judgments
.
Rubenstein
and
Goodenough
(
1965
)
solicited
human
judgments
of
semantic
similarity
for
65
pairs
of
common
nouns
on
a
scale
of
zero
to
four
.
Miller
and
Charles
(
1991
)
repeated
their
experiment
on
a
subset
of
29
noun
pairs
(
out
of
30
total
)
and
found
that
although
individuals
varied
among
their
judgments
,
in
aggregate
the
scores
were
highly
correlated
with
those
found
by
Ruben-stein
and
Goodenough
(
at
p
=
.
944
by
our
calculation
)
.
Resnik
(
1999
)
replicated
the
Miller
and
Charles
experiment
and
reported
that
the
average
per-subject
linear
cor
-
relation
on
the
dataset
was
around
r
=
0.90
,
providing
a
rough
upper
bound
on
any
system
's
linear
correlation
performance
with
respect
to
the
Miller
and
Charles
data
.
Figure
3
shows
that
the
ZKL
measure
on
the
MarkovLink
model
has
linear
correlation
coefficient
r
=
.
903
—
at
the
limit
of
human
inter-annotator
agreement
.
Recently
,
a
larger
set
of
word
relatedness
judgments
was
obtained
by
(
Finkelstein
et
al.
,
2002
)
in
the
collection
's
name
,
the
study
instructed
participants
to
score
word
pairs
for
relatedness
(
on
a
scale
of
0
to
10
)
,
which
is
in
contrast
to
the
similarity
judgments
requested
of
the
Miller
and
Charles
(
MC
)
and
Rubenstein
and
Goodenough
(
RG
)
participants
.
For
this
reason
,
the
WordSimilarity-353
data
contains
many
pairs
that
are
not
semantically
similar
but
still
receive
high
scores
,
such
as
"
computer-software
"
at
8.81
.
WS-353
contains
pairs
that
include
non-nouns
,
such
as
"
eat-drink
,
"
one
proper
noun
not
appearing
in
WordNet
(
"
Maradona-football
"
)
,
and
some
pairs
potentially
subject
to
political
bias
.
Again
,
the
aggregate
human
judgments
correlate
well
with
earlier
data
sets
where
they
overlap
—
the
30
judgments
that
WordSimilarity-353
shares
with
the
Miller
and
Charles
data
have
p
=
.
939
and
the
29
shared
with
Rubenstein
and
Goodenough
have
p
=
.
904
(
by
our
calculations
)
.
We
generated
similarity
scores
for
word
pairs
in
all
three
data
sets
using
the
three
variants
of
our
walk
model
(
MarkovLink
,
MarkovGloss
,
MarkovJoined
)
and
with
multiple
distributional
distance
measures
.
We
used
the
WordNet
:
:
Similarity
package
(
Pedersen
et
al.
,
2004
)
to
compute
baseline
scores
for
several
existing
measures
,
noting
that
one
word
pair
was
not
processed
in
WS-353
because
one
of
the
words
was
missing
from
WordNet
.
The
results
are
summarized
in
Table
2
.
These
numbers
differ
slightly
from
previously
reported
scores
due
to
variations
in
the
exact
experimental
setup
,
WordNet
version
,
and
the
method
of
breaking
ties
when
computing
p
(
here
we
break
ties
using
the
product-moment
formulation
of
Spearman
's
rank
correlation
coefficient
)
.
It
is
worth
noting
that
in
their
experiments
,
(
Patwardhan
and
Pedersen
,
2006
)
report
that
the
Vector
method
has
rank
correlation
coefficients
of
.
91
and
.
90
for
MC
and
RG
,
respectively
,
which
are
also
top
performing
values
.
In
our
experiments
,
the
MarkovLink
model
with
ZKL
distance
measure
was
the
best
performing
model
overall
.
MarkovGloss
and
MarkovJoined
were
also
strong
contenders
but
with
the
cosine
measure
instead
of
ZKL
.
One
reason
for
this
distinction
is
that
the
stationary
distributions
resulting
from
the
MarkovLink
model
are
nonzero
for
all
but
the
initial
word
nodes
(
i.e.
non-zero
for
all
Synset
nodes
)
.
Consequently
,
ZKL
's
re-estimate
for
the
zero
terms
adds
little
information
.
By
contrast
,
the
MarkovGloss
and
MarkovJoined
models
include
links
that
traverse
from
Synset
nodes
to
TokenPOS
nodes
,
re
-
Figure
3
:
Correlation
with
the
Miller
&amp;
Charles
data
sets
by
linear
correlation
(
left
)
and
rank
correlation
(
right
)
for
the
MarkovLink
model
.
All
data
points
were
based
on
one
set
of
stationary
distributions
over
the
graph
;
only
the
divergence
measure
between
those
distributions
is
varied
.
Note
that
ZKLY
dominates
both
graphs
but
skew
divergence
does
well
for
increasing
a
(
computed
as
1
—
2Y
)
.
Gamma
is
swept
over
the
range
0
to
1
,
then
1
through
20
,
then
20
through
40
at
equal
resolutions
.
MarkovLink
(
ZKL
)
MarkovGloss
(
cosine
)
MarkovJoined
(
cosine
)
Gloss
Vectors
Extended
Lesk
Jiang-Conrath
Table
2
:
Spearman
's
p
rank
correlation
coefficients
with
human
judgments
using
7
=
2.0
for
ZKL
.
Note
that
figure
3
demonstrates
ZKL
's
insensitivity
with
regard
to
the
parameter
setting
for
the
MarkovLink
model
.
sulting
in
a
final
stationary
distribution
with
more
(
and
more
meaningful
)
zero
/
non-zero
pairs
.
Hence
the
proper
setting
of
gamma
(
or
alpha
for
skew
divergence
)
is
of
greater
importance
.
ZKL
's
performance
improves
with
tuning
of
gamma
,
but
cosine
similarity
remained
the
more
robust
measure
for
these
distributions
.
6
Conclusion
In
this
paper
,
we
have
introduced
a
new
measure
of
lexical
relatedness
based
on
the
divergence
of
the
stationary
distributions
computed
from
random
walks
over
graphs
extracted
WordNet
.
We
have
explored
the
structural
properties
of
extracted
semantic
graphs
and
characterized
the
distinctly
different
types
of
stationary
distributions
that
result
.
We
explored
several
distance
measures
on
these
distributions
,
including
ZKL
,
a
novel
variant
of
KL-divergence
.
Our
best
relatedness
measure
is
at
the
limit
of
human
inter-annotator
agreement
and
is
one
of
the
strongest
measures
of
semantic
relatedness
that
uses
only
WordNet
as
its
underlying
lexical
resource
.
In
future
work
,
we
hope
to
integrate
other
lexical
resources
such
as
Wikipedia
into
the
walk
.
Incorporating
more
types
of
links
from
more
resources
will
underline
the
importance
of
determining
appropriate
relative
weights
for
all
of
the
types
of
edges
in
the
walk
's
matrix
.
Even
for
WordNet
,
we
believe
that
certain
link
types
,
such
as
antonyms
,
may
be
more
or
less
appropriate
for
certain
tasks
and
should
weighted
accordingly
.
And
while
our
measure
of
lexical
relatedness
correlates
well
with
human
judgments
,
we
hope
to
show
performance
gains
in
a
real-word
task
from
the
use
of
our
measure
.
Acknowledgments
Thanks
to
Christopher
D.
Manning
and
Dan
Jurafsky
for
their
helpful
comments
and
suggestions
.
We
are
also
grateful
to
Siddharth
Patwardhan
and
Ted
Pedersen
for
assistance
in
comparing
against
their
system
.
Thanks
to
Sushant
Prakash
,
Rion
Snow
,
and
Varun
Ganapathi
for
their
advice
on
pursuing
some
of
the
ideas
in
this
paper
,
and
to
our
anonymous
reviewers
for
their
helpful
critiques
.
Daniel
Ramage
was
funded
in
part
by
an
NDSEG
fellowship
.
This
work
was
also
supported
in
part
by
the
DTO
AQUAINT
Program
,
the
DARPA
GALE
Program
,
and
the
ONR
(
MURI
award
N000140510388
)
.
