We
develop
latent
Dirichlet
allocation
with
WordNet
(
LDAWN
)
,
an
unsupervised
probabilistic
topic
model
that
includes
word
sense
as
a
hidden
variable
.
We
develop
a
probabilistic
posterior
inference
algorithm
for
simultaneously
disambiguating
a
corpus
and
learning
the
domains
in
which
to
consider
each
word
.
Using
the
WordNet
hierarchy
,
we
embed
the
construction
of
Ab-ney
and
Light
(
1999
)
in
the
topic
model
and
show
that
automatically
learned
domains
improve
WSD
accuracy
compared
to
alternative
contexts
.
1
Introduction
Word
sense
disambiguation
(
WSD
)
is
the
task
of
determining
the
meaning
of
an
ambiguous
word
in
its
context
.
It
is
an
important
problem
in
natural
language
processing
(
NLP
)
because
effective
WSD
can
improve
systems
for
tasks
such
as
information
retrieval
,
machine
translation
,
and
summarization
.
In
this
paper
,
we
develop
latent
Dirichlet
allocation
with
WordNet
(
LDAWN
)
,
a
generative
probabilistic
topic
model
for
WSD
where
the
sense
of
the
word
is
a
hidden
random
variable
that
is
inferred
from
data
.
There
are
two
central
advantages
to
this
approach
.
First
,
with
LDAWN
we
automatically
learn
the
context
in
which
a
word
is
disambiguated
.
Rather
than
disambiguating
at
the
sentence-level
or
the
document-level
,
our
model
uses
the
other
words
that
share
the
same
hidden
topic
across
many
documents
.
Second
,
LDAWN
is
a
fully-fledged
generative
model
.
Generative
models
are
modular
and
can
be
easily
combined
and
composed
to
form
more
com
-
plicated
models
.
(
As
a
canonical
example
,
the
ubiquitous
hidden
Markov
model
is
a
series
of
mixture
models
chained
together
.
)
Thus
,
developing
a
generative
model
for
WSD
gives
other
generative
NLP
algorithms
a
natural
way
to
take
advantage
of
the
hidden
senses
of
words
.
In
general
,
topic
models
are
statistical
models
of
text
that
posit
a
hidden
space
of
topics
in
which
the
corpus
is
embedded
(
Blei
et
al.
,
2003
)
.
Given
a
corpus
,
posterior
inference
in
topic
models
amounts
to
automatically
discovering
the
underlying
themes
that
permeate
the
collection
.
Topic
models
have
recently
been
applied
to
information
retrieval
(
Wei
and
Croft
,
2006
)
,
text
classification
(
Blei
et
al.
,
2003
)
,
and
dialogue
segmentation
(
Purver
et
al.
,
2006
)
.
While
topic
models
capture
the
polysemous
use
of
words
,
they
do
not
carry
the
explicit
notion
of
sense
that
is
necessary
for
WSD
.
LDAWN
extends
the
topic
modeling
framework
to
include
a
hidden
meaning
in
the
word
generation
process
.
In
this
case
,
posterior
inference
discovers
both
the
topics
of
the
corpus
and
the
meanings
assigned
to
each
of
its
words
.
After
introducing
a
disambiguation
scheme
based
on
probabilistic
walks
over
the
WordNet
hierarchy
(
Section
2
)
,
we
embed
the
WordNet-Walk
in
a
topic
model
,
where
each
topic
is
associated
with
walks
that
prefer
different
neighborhoods
of
WordNet
(
Section
2.1
)
.
Then
,
we
describe
a
Gibbs
sampling
algorithm
for
approximate
posterior
inference
that
learns
the
senses
and
topics
that
best
explain
a
corpus
(
Section
3
)
.
Finally
,
we
evaluate
our
system
on
real-world
WSD
data
,
discuss
the
properties
of
the
topics
and
disambiguation
accuracy
results
,
and
draw
connections
to
other
WSD
algorithms
from
the
research
literature
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
1024-1033
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
entity
I
Figure
1
:
The
possible
paths
to
reach
the
word
"
colt
"
in
WordNet
.
Dashed
lines
represent
omitted
links
.
All
words
in
the
synset
containing
"
revolver
"
are
shown
,
but
only
one
word
from
other
synsets
is
shown
.
Edge
labels
are
probabilities
of
transitioning
from
synset
i
to
synset
j.
Note
how
this
favors
frequent
terms
,
such
as
"
revolver
,
"
over
ones
like
"
six-shooter
.
"
2
Topic
models
and
WordNet
number
of
topics
the
successors
of
synset
s
in
topic
k
scalar
that
,
when
multiplied
by
as
normalized
vector
whose
ith
entry
,
when
multiplied
by
S
,
gives
the
prior
probability
for
going
from
s
to
i
multinomial
probability
vector
over
the
topics
that
generate
document
d
assignment
of
a
word
to
a
topic
a
path
assignment
through
WordNet
ending
at
a
word
.
set
i
to
synset
j.
Table
1
:
A
summary
of
the
notation
used
in
the
paper
.
Bold
vectors
correspond
to
collections
of
variables
(
i.e.
zu
refers
to
a
topic
of
a
single
word
,
but
z
\
:
D
are
the
topics
assignments
of
words
in
document
1
through
D
)
.
The
WordNet-Walk
is
a
probabilistic
process
of
word
generation
that
is
based
on
the
hyponomy
relationship
in
WordNet
(
Miller
,
1990
)
.
WordNet
,
a
lexical
resource
designed
by
psychologists
and
lexicographers
to
mimic
the
semantic
organization
in
the
human
mind
,
links
"
synsets
"
(
short
for
synonym
sets
)
with
myriad
connections
.
The
specific
relation
we
're
interested
in
,
hyponomy
,
points
from
general
concepts
to
more
specific
ones
and
is
sometimes
called
the
"
is-a
"
relationship
.
As
first
described
by
Abney
and
Light
(
1999
)
,
we
imagine
an
agent
who
starts
at
synset
[
entity
]
,
which
points
to
every
noun
in
WordNet
2.1
by
some
sequence
of
hyponomy
relations
,
and
then
chooses
the
next
node
in
its
random
walk
from
the
hyponyms
of
its
current
position
.
The
agent
repeats
this
process
until
it
reaches
a
leaf
node
,
which
corresponds
to
a
single
word
(
each
of
the
synset
's
words
are
unique
leaves
of
a
synset
in
our
construction
)
.
For
an
example
of
all
the
paths
that
might
generate
the
word
"
colt
"
see
Figure
1
.
The
WordNet-Walk
is
parameterized
by
a
set
of
distributions
over
children
for
each
synset
s
in
WordNet
,
/
3s
.
The
WordNet-Walk
has
two
important
properties
.
First
,
it
describes
a
random
process
for
word
generation
.
Thus
,
it
is
a
distribution
over
words
and
thus
can
be
integrated
into
any
generative
model
of
text
,
such
as
topic
models
.
Second
,
the
synset
that
produces
each
word
is
a
hidden
random
variable
.
Given
a
word
assumed
to
be
generated
by
a
WordNet-Walk
,
we
can
use
posterior
inference
to
predict
which
synset
produced
the
word
.
These
properties
allow
us
to
develop
LDAWN
,
which
is
a
fusion
of
these
WordNet-Walks
and
latent
Dirichlet
allocation
(
LDA
)
(
Blei
et
al.
,
2003
)
,
a
probabilistic
model
of
documents
that
is
an
improvement
to
pLSI
(
Hofmann
,
1999
)
.
LDA
assumes
that
there
are
K
"
topics
,
"
multinomial
distributions
over
words
,
which
describe
a
collection
.
Each
document
exhibits
multiple
topics
,
and
each
word
in
each
document
is
associated
with
one
of
them
.
Although
the
term
"
topic
"
evokes
a
collection
of
ideas
that
share
a
common
theme
and
although
the
topics
derived
by
LDA
seem
to
possess
semantic
coherence
,
there
is
no
reason
to
believe
this
would
be
true
of
the
most
likely
multinomial
distributions
that
could
have
created
the
corpus
given
the
assumed
generative
model
.
That
semantically
similar
words
are
likely
to
occur
together
is
a
byproduct
of
how
language
is
actually
used
.
In
LDAWN
,
we
replace
the
multinomial
topic
distributions
with
a
WordNet-Walk
,
as
described
above
.
LDAWN
assumes
a
corpus
is
generated
by
the
following
process
(
for
an
overview
of
the
notation
used
in
this
paper
,
see
Table
1
)
.
(
a
)
For
each
synset
s
,
randomly
choose
transition
probabilities
pk
,
a
~
Dir
(
Sas
)
.
ii
.
Create
a
path
Ad
,
n
starting
with
Ao
as
the
root
node
.
B.
If
Ai+1
is
a
leaf
node
,
generate
the
associated
word
.
Otherwise
,
repeat
.
Every
element
of
this
process
,
including
the
synsets
,
is
hidden
except
for
the
words
of
the
documents
.
Thus
,
given
a
collection
of
documents
,
our
goal
is
to
perform
posterior
inference
,
which
is
the
task
of
determining
the
conditional
distribution
of
the
hidden
variables
given
the
observations
.
In
the
case
of
LDAWN
,
the
hidden
variables
are
the
parameters
of
the
K
WordNet-Walks
,
the
topic
assignments
of
each
word
in
the
collection
,
and
the
synset
path
of
each
word
.
In
a
sense
,
posterior
inference
reverses
the
process
described
above
.
Specifically
,
given
a
document
collection
w
\
:
D
,
the
full
posterior
is
where
the
constant
of
proportionality
is
the
marginal
likelihood
of
the
observed
data
.
Note
that
by
encoding
the
synset
paths
as
a
hidden
variable
,
we
have
posed
the
WSD
problem
as
a
question
of
posterior
probabilistic
inference
.
Further
note
that
we
have
developed
an
unsupervised
model
.
No
labeled
data
is
needed
to
disambiguate
a
corpus
.
Learning
the
posterior
distribution
amounts
to
simultaneously
decomposing
a
corpus
into
topics
and
its
words
into
their
synsets
.
The
intuition
behind
LDAWN
is
that
the
words
in
a
topic
will
have
similar
meanings
and
thus
share
paths
within
WordNet
.
For
example
,
WordNet
has
two
senses
for
the
word
"
colt
;
"
one
referring
to
a
young
male
horse
and
the
other
to
a
type
of
handgun
(
see
Figure
1
)
.
Although
we
have
no
a
priori
way
of
knowing
which
of
the
two
paths
to
favor
for
a
document
,
we
assume
that
similar
concepts
will
also
appear
in
the
document
.
Documents
with
unambiguous
nouns
such
as
"
six-shooter
"
and
"
smoothbore
"
would
make
paths
that
pass
through
the
synset
[
firearm
,
piece
,
small-arm
]
more
likely
than
those
going
through
[
animal
,
animate
being
,
beast
,
brute
,
creature
,
fauna
]
.
In
practice
,
we
hope
to
see
a
WordNet-Walk
that
looks
like
Figure
2
,
which
points
to
the
right
sense
of
cancer
for
a
medical
context
.
LDAWN
is
a
Bayesian
framework
,
as
each
variable
has
a
prior
distribution
.
In
particular
,
the
Dirichlet
prior
for
/
/
s
,
specified
by
a
scaling
factor
S
and
a
normalized
vector
as
fulfills
two
functions
.
First
,
as
the
overall
strength
of
S
increases
,
we
place
a
greater
emphasis
on
the
prior
.
This
is
equivalent
to
the
need
for
balancing
as
noted
by
Abney
and
Light
(
1999
)
.
The
other
function
that
the
Dirichlet
prior
serves
is
to
enable
us
to
encode
any
information
we
have
about
how
we
suspect
the
transitions
to
children
nodes
will
be
distributed
.
For
instance
,
we
might
expect
that
the
words
associated
with
a
synset
will
be
produced
in
a
way
roughly
similar
to
the
token
probability
in
a
corpus
.
For
example
,
even
though
"
meal
"
might
refer
to
both
ground
cereals
or
food
eaten
at
a
single
sitting
and
"
repast
"
exclusively
to
the
latter
,
the
synset
[
meal
,
repast
,
food
eaten
at
a
single
sitting
]
still
prefers
to
transition
to
"
meal
"
over
"
repast
"
given
the
overall
corpus
counts
(
see
Figure
1
,
which
shows
prior
transition
probabilities
for
"
revolver
"
)
.
By
setting
as
&gt;
i
,
the
prior
probability
of
transition-ing
from
synset
s
to
node
i
,
proportional
to
the
total
number
of
observed
tokens
in
the
children
of
i
,
we
introduce
a
probabilistic
variation
on
information
content
(
Resnik
,
1995
)
.
As
in
Resnik
's
definition
,
this
value
for
non-word
nodes
is
equal
to
the
sum
of
all
the
frequencies
of
hyponym
words
.
Unlike
Resnik
,
we
do
not
divide
frequency
among
all
senses
of
a
word
;
each
sense
of
a
word
contributes
its
full
frequency
to
a.
3
Posterior
Inference
with
Gibbs
Sampling
As
described
above
,
the
problem
of
WSD
corresponds
to
posterior
inference
:
determining
the
probability
distribution
of
the
hidden
variables
given
observed
words
and
then
selecting
the
synsets
of
the
most
likely
paths
as
the
correct
sense
.
Directly
computing
this
posterior
distribution
,
however
,
is
not
tractable
because
of
the
difficulty
of
calculating
the
normalizing
constant
in
Equation
1
.
To
approximate
the
posterior
,
we
use
Gibbs
sampling
,
which
has
proven
to
be
a
successful
approximate
inference
technique
for
LDA
(
Griffiths
and
Steyvers
,
2004
)
.
In
Gibbs
sampling
,
like
all
Markov
chain
Monte
Carlo
methods
,
we
repeatedly
sample
from
a
Markov
chain
whose
stationary
distribution
is
the
posterior
of
interest
(
Robert
and
Casella
,
2004
)
.
Even
though
we
don
't
know
the
full
posterior
,
the
samples
can
be
used
to
form
an
empirical
estimate
of
the
target
distribution
.
In
LDAWN
,
the
samples
contain
a
configuration
of
the
latent
semantic
states
of
the
system
,
revealing
the
hidden
topics
and
paths
that
likely
led
to
the
observed
data
.
Gibbs
sampling
reproduces
the
posterior
distribution
by
repeatedly
sampling
each
hidden
variable
conditioned
on
the
current
state
of
the
other
hidden
variables
and
observations
.
More
precisely
,
the
state
is
given
by
a
set
of
assignments
where
each
word
is
assigned
to
a
path
through
one
of
K
WordNet-Walk
topics
:
uth
word
wu
has
a
topic
assignment
zu
and
a
path
assignment
Au
.
We
use
z_u
and
A_u
to
represent
the
topic
and
path
assignments
of
all
words
except
for
u
,
respectively
.
Sampling
a
new
topic
for
the
word
wu
requires
us
to
consider
all
of
the
paths
that
wu
can
take
in
each
topic
and
the
topics
of
the
other
words
in
the
document
u
is
in
.
The
probability
of
wu
taking
on
topic
i
is
proportional
to
which
is
the
probability
of
selecting
z
from
9d
times
the
probability
of
a
path
generating
wu
from
a
path
in
the
ith
WordNet-Walk
.
The
first
term
,
the
topic
probability
of
the
uth
word
,
is
based
on
the
assignments
to
the
K
topics
for
words
other
than
u
in
this
document
,
where
n
(
^
,
j
is
the
number
of
words
other
than
u
in
topic
j
for
the
document
d
that
u
appears
in
.
The
second
term
in
Equation
2
is
a
sum
over
the
probabilities
of
every
path
that
could
have
generated
the
word
wu
.
In
practice
,
this
sum
can
be
computed
using
a
dynamic
program
for
all
nodes
that
have
unique
parent
(
i.e.
those
that
can
't
be
reached
by
more
than
one
path
)
.
Although
the
probability
of
a
path
is
specific
to
the
topic
,
as
the
transition
probabilities
for
a
synset
are
different
across
topics
,
we
will
omit
the
topic
index
in
the
equation
,
3.1
Transition
Probabilities
Computing
the
probability
of
a
path
requires
us
to
take
a
product
over
our
estimate
of
the
probability
from
transitioning
from
i
to
j
for
all
nodes
i
and
j
in
the
path
A.
The
other
path
assignments
within
this
topic
,
however
,
play
an
important
role
in
shaping
the
transition
probabilities
.
From
the
perspective
of
a
single
node
i
,
only
paths
that
pass
through
that
node
affect
the
probability
of
u
also
passing
through
that
node
.
It
's
convenient
to
have
an
explicit
count
of
all
of
the
paths
that
transition
from
i
to
j
in
this
topic
's
WordNet-Walk
,
so
we
use
Ti
"
_ju
to
represent
all
of
the
paths
that
go
from
i
to
j
in
a
topic
other
than
the
path
currently
assigned
to
u.
Given
the
assignment
of
all
other
words
to
paths
,
calculating
the
probability
of
transitioning
from
i
to
j
with
word
u
requires
us
to
consider
the
prior
a
and
the
observations
Ti
)
j
-
in
our
estimate
of
the
expected
value
of
the
probability
of
transitioning
from
i
to
j
,
As
mentioned
in
Section
2.1
,
we
paramaterize
the
prior
for
synset
i
as
a
vector
ai
,
which
sums
to
one
,
and
a
scale
parameter
S.
The
next
step
,
once
we
've
selected
a
topic
,
is
to
select
a
path
within
that
topic
.
This
requires
the
computation
of
the
path
probabilities
as
specified
in
Equation
4
for
all
of
the
paths
wu
can
take
in
the
sampled
topic
and
then
sampling
from
the
path
probabilities
.
The
Gibbs
sampler
is
essentially
a
randomized
hill
climbing
algorithm
on
the
posterior
likelihood
as
a
function
of
the
configuration
of
hidden
variables
.
The
numerator
of
Equation
1
is
proportional
to
that
posterior
and
thus
allows
us
to
track
the
sampler
's
progress
.
We
assess
convergence
to
a
local
mode
of
the
posterior
by
monitoring
this
quantity
.
4
Experiments
In
this
section
,
we
describe
the
properties
of
the
topics
induced
by
running
the
previously
described
Gibbs
sampling
method
on
corpora
and
how
these
topics
improve
WSD
accuracy
.
Of
the
two
data
sets
used
during
the
course
of
our
evaluation
,
the
primary
dataset
was
SemCor
(
Miller
et
al.
,
1993
)
,
which
is
a
subset
of
the
Brown
corpus
with
many
nouns
manually
labeled
with
the
correct
WordNet
sense
.
The
words
in
this
dataset
are
lemmatized
,
and
multi-word
expressions
that
are
present
in
WordNet
are
identified
.
Only
the
words
in
SemCor
were
used
in
the
Gibbs
sampling
procedure
;
the
synset
assignments
were
only
used
for
assessing
the
accuracy
of
the
final
predictions
.
We
also
used
the
British
National
Corpus
,
which
is
not
lemmatized
and
which
does
not
have
multiword
expressions
.
The
text
was
first
run
through
a
lemmatizer
,
and
then
sequences
of
words
which
matched
a
multi-word
expression
in
WordNet
were
joined
together
into
a
single
word
.
We
took
nouns
that
appeared
in
S
emCor
twice
or
in
the
BNC
at
least
25
times
and
used
the
BNC
to
compute
the
information-content
analog
a
for
individual
nouns
(
For
example
,
the
probabilities
in
Figure
1
correspond
to
a
)
.
Like
the
topics
created
by
structures
such
as
LDA
,
the
topics
in
Table
2
coalesce
around
reasonable
themes
.
The
word
list
was
compiled
by
summing
over
all
of
the
possible
leaves
that
could
have
generated
each
of
the
words
and
sorting
the
words
by
decreasing
probability
.
In
the
vast
majority
of
cases
,
a
single
synset
's
high
probability
is
responsible
for
the
words
'
positions
on
the
list
.
Reassuringly
,
many
of
the
top
senses
for
the
present
words
correspond
to
the
most
frequent
sense
in
SemCor
.
For
example
,
in
Topic
4
,
the
senses
for
"
space
"
and
"
function
"
correspond
to
the
top
senses
in
SemCor
,
and
while
the
top
sense
for
"
set
"
corresponds
to
"
an
abstract
collection
of
numbers
or
symbols
"
rather
than
"
a
group
of
the
same
kind
that
belong
together
and
are
so
used
,
"
it
makes
sense
given
the
math-based
words
in
the
topic
.
"
Point
,
"
however
,
corresponds
to
the
sense
used
in
the
phrase
"
I
got
to
the
point
of
boiling
the
water
,
"
which
is
neither
the
top
S
emCor
sense
nor
a
sense
which
makes
sense
given
the
other
words
in
the
topic
.
While
the
topics
presented
in
Table
2
resemble
the
topics
one
would
obtain
through
models
like
LDA
(
Blei
et
al.
,
2003
)
,
they
are
not
identical
.
Because
of
the
lengthy
process
of
Gibbs
sampling
,
we
initially
thought
that
using
LDA
assignments
as
an
initial
state
would
converge
faster
than
a
random
initial
assignment
.
While
this
was
the
case
,
it
converged
to
a
state
that
less
probable
than
the
randomly
initialized
state
and
no
better
at
sense
disambiguation
(
and
sometimes
worse
)
.
The
topics
presented
in
2
represent
words
both
that
co-occur
together
in
a
corpus
and
co-occur
on
paths
through
WordNet
.
Because
topics
created
through
LDA
only
have
the
first
property
,
they
usually
do
worse
in
terms
of
both
total
probability
and
disambiguation
accuracy
(
see
Figure
3
)
.
Another
interesting
property
of
topics
in
LDAWN
is
that
,
with
higher
levels
of
smoothing
,
words
that
don
't
appear
in
a
corpus
(
or
appear
rarely
)
but
are
in
similar
parts
of
WordNet
might
have
relatively
high
probability
in
a
topic
.
For
example
,
"
maturity
"
in
topic
two
in
Table
2
is
sandwiched
between
"
foot
"
and
"
center
,
"
both
of
which
occur
about
five
times
more
than
"
maturity
.
"
This
might
improve
LDA-based
information
retrieval
schemes
(
Wei
and
Croft
,
2006
)
.
___J
.
malignancy
Figure
2
:
The
possible
paths
to
reach
the
word
"
cancer
"
in
WordNet
along
with
transition
probabilities
from
the
medically-themed
Topic
2
in
Table
2
,
with
the
most
probable
path
highlighted
.
The
dashed
lines
represent
multiple
links
that
have
been
consolidated
,
and
synsets
are
represented
by
their
offsets
within
WordNet
2.1
.
Some
words
for
immediate
hypernyms
have
also
been
included
to
give
context
.
In
all
other
topics
,
the
person
,
animal
,
or
constellation
senses
were
preferred
.
president
material
treatment
election
function
administration
official
requirement
polynomial
audience
yesterday
operator
component
production
maturity
communication
direction
petitioner
movement
interest
relationship
Table
2
:
The
most
probable
words
from
six
randomly
chosen
WordNet-walks
from
a
thirty-two
topic
model
trained
on
the
words
in
SemCor
.
These
are
summed
over
all
of
the
possible
synsets
that
generate
the
words
.
However
,
the
vast
majority
of
the
contributions
come
from
a
single
synset
.
Smoothing
Factor
Iteration
Figure
3
:
Topics
seeded
with
LDA
initially
have
a
higher
disambiguation
accuracy
,
but
are
quickly
matched
by
unseeded
topics
.
The
probability
for
the
seeded
topics
starts
lower
and
remains
lower
.
Because
the
Dirichlet
smoothing
factor
in
part
determines
the
topics
,
it
also
affects
the
disambiguation
.
Figure
4
shows
the
modal
disambiguation
achieved
for
each
of
the
settings
of
S
{
0.1,1
,
5,10,15
,
20
}
.
Each
line
is
one
setting
of
K
and
each
point
on
the
line
is
a
setting
of
S.
Each
data
point
is
a
run
for
the
Gibbs
sampler
for
10,000
iterations
.
The
disambiguation
,
taken
at
the
mode
,
improved
with
moderate
settings
of
S
,
which
suggests
that
the
data
are
still
sparse
for
many
of
the
walks
,
although
the
improvement
vanishes
if
S
dominates
with
much
larger
values
.
This
makes
sense
,
as
each
walk
has
over
100,000
parameters
,
there
are
fewer
than
100,000
words
in
SemCor
,
and
each
Figure
4
:
Each
line
represents
experiments
with
a
set
number
of
topics
and
variable
amounts
of
smoothing
on
the
SemCor
corpus
.
The
random
baseline
is
at
the
bottom
of
the
graph
,
and
adding
topics
improves
accuracy
.
As
smoothing
increases
,
the
prior
(
based
on
token
frequency
)
becomes
stronger
.
Accuracy
is
the
percentage
of
correctly
disambiguated
polysemous
words
in
SemCor
at
the
mode
.
word
only
serves
as
evidence
to
at
most
19
parameters
(
the
length
of
the
longest
path
in
WordNet
)
.
Generally
,
a
greater
number
of
topics
increased
the
accuracy
of
the
mode
,
but
after
around
sixteen
topics
,
gains
became
much
smaller
.
The
effect
of
a
is
also
related
to
the
number
of
topics
,
as
a
value
of
S
for
a
very
large
number
of
topics
might
overwhelm
the
observed
data
,
while
the
same
value
of
S
might
be
the
perfect
balance
for
a
smaller
number
of
topics
.
For
comparison
,
the
method
of
using
a
WordNet-Walk
applied
to
smaller
contexts
such
as
sentences
or
documents
achieves
an
accuracy
of
between
26
%
and
30
%
,
depending
on
the
level
of
smoothing
.
5
Error
Analysis
This
method
works
well
in
cases
where
the
delineation
can
be
readily
determined
from
the
overall
topic
of
the
document
.
Words
such
as
"
kid
,
"
"
may
,
"
"
shear
,
"
"
coach
,
"
"
incident
,
"
"
fence
,
"
"
bee
,
"
and
(
previously
used
as
an
example
)
"
colt
"
were
all
perfectly
disambiguated
by
this
method
.
Figure
2
shows
the
WordNet-Walk
corresponding
to
a
medical
topic
that
correctly
disambiguates
"
cancer
.
"
Problems
arose
,
however
,
with
highly
frequent
words
,
such
as
"
man
"
and
"
time
"
that
have
many
senses
and
can
occur
in
many
types
of
documents
.
For
example
,
"
man
"
can
be
associated
with
many
possible
meanings
:
island
,
game
equipment
,
servant
,
husband
,
a
specific
mammal
,
etc.
Although
we
know
that
the
"
adult
male
"
sense
should
be
preferred
,
the
alternative
meanings
will
also
be
likely
if
they
can
be
assigned
to
a
topic
that
shares
common
paths
in
WordNet
;
the
documents
contain
,
however
,
many
other
places
,
jobs
,
and
animals
which
are
reasonable
explanations
(
to
LDAWN
)
of
how
"
man
"
was
generated
.
Unfortunately
,
"
man
"
is
such
a
ubiquitous
term
that
topics
,
which
are
derived
from
the
frequency
of
words
within
an
entire
document
,
are
ultimately
uninformative
about
its
usage
.
While
mistakes
on
these
highly
frequent
terms
significantly
hurt
our
accuracy
,
errors
associated
with
less
frequent
terms
reveal
that
WordNet
's
structure
is
not
easily
transformed
into
a
probabilistic
graph
.
For
instance
,
there
are
two
senses
of
the
word
"
quarterback
,
"
a
player
in
American
football
.
One
is
position
itself
and
the
other
is
a
person
playing
that
position
.
While
one
would
expect
co-occurrence
in
sentences
such
as
"
quarterback
is
a
easy
position
,
so
our
quarterback
is
happy
,
"
the
paths
to
both
terms
share
only
the
root
node
,
thus
making
it
highly
unlikely
a
topic
would
cover
both
senses
.
Because
of
WordNet
's
breadth
,
rare
senses
also
impact
disambiguation
.
For
example
,
the
metonymical
use
of
"
door
"
to
represent
a
whole
building
as
in
the
phrase
"
girl
next
door
"
is
under
the
same
parent
as
sixty
other
synsets
containing
"
bridge
,
"
"
balcony
,
"
"
body
,
"
"
arch
,
"
"
floor
,
"
and
"
corner
.
"
Surrounded
by
such
common
terms
that
are
also
likely
to
co-occur
with
the
more
conventional
meanings
of
door
,
this
very
rare
sense
becomes
the
preferred
disambiguation
of
"
door
.
"
6
Related
Work
Abney
and
Light
's
initial
probabilistic
WSD
approach
(
1999
)
was
further
developed
into
a
Bayesian
network
model
by
Ciaramita
and
Johnson
(
2000
)
,
who
likewise
used
the
appearance
of
monosemous
terms
close
to
ambiguous
ones
to
"
explain
away
"
the
usage
of
ambiguous
terms
in
selectional
restrictions
.
We
have
adapted
these
approaches
and
put
them
into
the
context
of
a
topic
model
.
Recently
,
other
approaches
have
created
ad
hoc
connections
between
synsets
in
WordNet
and
then
considered
walks
through
the
newly
created
graph
.
Given
the
difficulties
of
using
existing
connections
in
WordNet
,
Mihalcea
(
2005
)
proposed
creating
links
between
adjacent
synsets
that
might
comprise
a
sentence
,
initially
setting
weights
to
be
equal
to
the
Lesk
overlap
between
the
pairs
,
and
then
using
the
PageRank
algorithm
to
determine
the
stationary
distribution
over
synsets
.
Yarowsky
was
one
of
the
first
to
contend
that
"
there
is
one
sense
for
discourse
"
(
1992
)
.
This
has
lead
to
the
approaches
like
that
of
Magnini
(
Magnini
et
al.
,
2001
)
that
attempt
to
find
the
category
of
a
text
,
select
the
most
appropriate
synset
,
and
then
assign
the
selected
sense
using
domain
annotation
attached
to
WordNet
.
LDAWN
is
different
in
that
the
categories
are
not
an
a
priori
concept
that
must
be
painstakingly
annotated
within
WordNet
and
require
no
augmentation
of
WordNet
.
This
technique
could
indeed
be
used
with
any
hierarchy
.
Our
concepts
are
the
ones
that
best
partition
the
space
of
documents
and
do
the
best
job
of
describing
the
distinctions
of
diction
that
separate
documents
from
different
domains
.
6.2
Similarity
Measures
Our
approach
gives
a
probabilistic
method
of
using
information
content
(
Resnik
,
1995
)
as
a
starting
point
that
can
be
adjusted
to
cluster
words
in
a
given
topic
together
;
this
is
similar
to
the
Jiang-Conrath
similarity
measure
(
1997
)
,
which
has
been
used
in
many
applications
in
addition
to
disambiguation
.
Patwardhan
(
2003
)
offers
a
broad
evaluation
of
similarity
measures
for
WSD
.
Our
technique
for
combining
the
cues
of
topics
and
distance
in
WordNet
is
adjusted
in
a
way
similar
in
spirit
to
Buitelaar
and
Sacaleanu
(
2001
)
,
but
we
consider
the
appearance
of
a
single
term
to
be
evidence
for
not
just
that
sense
and
its
immediate
neighbors
in
the
hyponomy
tree
but
for
all
of
the
sense
's
children
and
ancestors
.
Like
McCarthy
(
2004
)
,
our
unsupervised
system
acquires
a
single
predominant
sense
for
a
domain
based
on
a
synthesis
of
information
derived
from
a
textual
corpus
,
topics
,
and
WordNet-derived
similarity
,
a
probabilistic
information
content
measure
.
By
adding
syntactic
information
from
a
thesaurus
derived
from
syntactic
features
(
taken
from
Lin
's
automatically
generated
thesaurus
(
1998
)
)
,
McCarthy
achieved
48
%
accuracy
in
a
similar
evaluation
on
SemCor
;
LDAWN
is
thus
substantially
less
effective
in
disambiguation
compared
to
state-of-the-art
methods
.
This
suggests
,
however
,
that
other
methods
might
be
improved
by
adding
topics
and
that
our
method
might
be
improved
by
using
more
information
than
word
counts
.
7
Conclusion
and
Future
Work
The
LDAWN
model
presented
here
makes
two
contributions
to
research
in
automatic
word
sense
disambiguation
.
First
,
we
demonstrate
a
method
for
automatically
partitioning
a
document
into
topics
that
includes
explicit
semantic
information
.
Second
,
we
show
that
,
at
least
for
one
simple
model
of
WSD
,
embedding
a
document
in
probabilistic
latent
structure
,
i.e.
,
a
"
topic
,
"
can
improve
WSD
.
There
are
two
avenues
of
research
with
LDAWN
that
we
will
explore
.
First
,
the
statistical
nature
of
this
approach
allows
LDAWN
to
be
used
as
a
component
in
larger
models
for
other
language
tasks
.
Other
probabilistic
models
of
language
could
insert
the
ability
to
query
synsets
or
paths
of
WordNet
.
Similarly
,
any
topic
based
information
retrieval
scheme
could
employ
topics
that
include
se-mantically
relevant
(
but
perhaps
unobserved
)
terms
.
Incorporating
this
model
in
a
larger
syntactically-aware
model
,
which
could
benefit
from
the
local
context
as
well
as
the
document
level
context
,
is
an
important
component
of
future
research
.
Second
,
the
results
presented
here
show
a
marked
improvement
in
accuracy
as
more
topics
are
added
to
the
baseline
model
,
although
the
final
result
is
not
comparable
to
state-of-the-art
techniques
.
As
most
errors
were
attributable
to
the
hyponomy
structure
of
WordNet
,
incorporating
the
novel
use
of
topic
modeling
presented
here
with
a
more
mature
unsu-pervised
WSD
algorithm
to
replace
the
underlying
WordNet-Walk
could
lead
to
advances
in
state-of-the-art
unsupervised
WSD
accuracy
.
