This
paper
presents
a
novel
approach
for
exploiting
the
global
context
for
the
task
of
word
sense
disambiguation
(
WSD
)
.
This
is
done
by
using
topic
features
constructed
using
the
latent
dirichlet
allocation
(
LDA
)
algorithm
on
unlabeled
data
.
The
features
are
incorporated
into
a
modified
naive
Bayes
network
alongside
other
features
such
as
part-of-speech
of
neighboring
words
,
single
words
in
the
surrounding
context
,
local
collocations
,
and
syntactic
patterns
.
In
both
the
English
all-words
task
and
the
English
lexical
sample
task
,
the
method
achieved
significant
improvement
over
the
simple
naive
Bayes
classifier
and
higher
accuracy
than
the
best
official
scores
on
Senseval-3
for
both
task
.
1
Introduction
Natural
language
tends
to
be
ambiguous
.
A
word
often
has
more
than
one
meanings
depending
on
the
context
.
Word
sense
disambiguation
(
WSD
)
is
a
natural
language
processing
(
NLP
)
task
in
which
the
correct
meaning
(
sense
)
of
a
word
in
a
given
context
is
to
be
determined
.
Supervised
corpus-based
approach
has
been
the
most
successful
in
WSD
to
date
.
In
such
an
approach
,
a
corpus
in
which
ambiguous
words
have
been
annotated
with
correct
senses
is
first
collected
.
Knowledge
sources
,
or
features
,
from
the
context
of
the
annotated
word
are
extracted
to
form
the
training
data
.
A
learning
algorithm
,
like
the
support
vector
machine
(
SVM
)
or
naive
Bayes
,
is
then
applied
on
the
training
data
to
learn
the
model
.
Finally
,
in
testing
,
the
learnt
model
is
applied
on
the
test
data
to
assign
the
correct
sense
to
any
ambiguous
word
.
The
features
used
in
these
systems
usually
include
local
features
,
such
as
part-of-speech
(
POS
)
of
neighboring
words
,
local
collocations
,
syntactic
patterns
and
global
features
such
as
single
words
in
the
surrounding
context
(
bag-of-words
)
(
Lee
and
Ng
,
2002
)
.
However
,
due
to
the
data
scarcity
problem
,
these
features
are
usually
very
sparse
in
the
training
data
.
There
are
,
on
average
,
11
and
28
training
cases
per
sense
in
Senseval
2
and
3
lexical
sample
task
respectively
,
and
6.5
training
cases
per
sense
in
the
SemCor
corpus
.
This
problem
is
especially
prominent
for
the
bag-of-words
feature
;
more
than
hundreds
of
bag-of-words
are
usually
extracted
for
each
training
instance
and
each
feature
could
be
drawn
from
any
English
word
.
A
direct
consequence
is
that
the
global
context
information
,
which
the
bag-of-words
feature
is
supposed
to
capture
,
may
be
poorly
represented
.
Our
approach
tries
to
address
this
problem
by
clustering
features
to
relieve
the
scarcity
problem
,
specifically
on
the
bag-of-words
feature
.
In
the
process
,
we
construct
topic
features
,
trained
using
the
latent
dirichlet
allocation
(
LDA
)
algorithm
.
We
train
the
topic
model
(
Blei
et
al.
,
2003
)
on
unlabeled
data
,
clustering
the
words
occurring
in
the
corpus
to
a
predefined
number
of
topics
.
We
then
use
the
resulting
topic
model
to
tag
the
bag-of-words
in
the
labeled
corpus
with
topic
distributions
.
We
incorporate
the
distributions
,
called
the
topic
features
,
using
a
simple
Bayesian
network
,
modified
from
naive
Bayes
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
1G15-1G23
,
Prague
,
June
2GG7
.
©
2GG7
Association
for
Computational
Linguistics
model
,
alongside
other
features
and
train
the
model
on
the
labeled
corpus
.
The
approach
gives
good
performance
on
both
the
lexical
sample
and
all-words
tasks
on
Senseval
data
.
The
paper
makes
mainly
two
contributions
.
First
,
we
are
able
to
show
that
a
feature
that
efficiently
captures
the
global
context
information
using
LDA
algorithm
can
significantly
improve
the
WSD
accuracy
.
Second
,
we
are
able
to
obtain
this
feature
from
unlabeled
data
,
which
spares
us
from
any
manual
labeling
work
.
We
also
showcase
the
potential
strength
of
Bayesian
network
in
the
WSD
task
,
obtaining
performance
that
rivals
state-of-arts
methods
.
2
Related
Work
Many
WSD
systems
try
to
tackle
the
data
scarcity
problem
.
Unsupervised
learning
is
introduced
primarily
to
deal
with
the
problem
,
but
with
limited
success
(
Snyder
and
Palmer
,
2004
)
.
In
another
approach
,
the
learning
algorithm
borrows
training
instances
from
other
senses
and
effectively
increases
the
training
data
size
.
In
(
Kohomban
and
Lee
,
2005
)
,
the
classifier
is
trained
using
grouped
senses
for
verbs
and
nouns
according
to
WordNet
top-level
synsets
and
thus
effectively
pooling
training
cases
across
senses
within
the
same
synset
.
Similarly
,
(
Ando
,
2006
)
exploits
data
from
related
tasks
,
using
all
labeled
examples
irrespective
of
target
words
for
learning
each
sense
using
the
Alternating
Structure
Optimization
(
ASO
)
algorithm
(
Ando
and
Zhang
,
2005a
;
Ando
and
Zhang
,
2005b
)
.
Parallel
texts
is
proposed
in
(
Resnik
and
Yarowsky
,
1997
)
as
potential
training
data
and
(
Chan
and
Ng
,
2005
)
has
shown
that
using
automatically
gathered
parallel
texts
for
nouns
could
significantly
increase
WSD
accuracy
,
when
tested
on
Senseval-2
English
all-words
task
.
Our
approach
is
somewhat
similar
to
that
of
using
generic
language
features
such
as
POS
tags
;
the
words
are
tagged
with
its
semantic
topic
that
may
be
trained
from
other
corpuses
.
3
Feature
Construction
We
first
present
the
latent
dirichlet
allocation
algorithm
and
its
inference
procedures
,
adapted
from
the
original
paper
(
Blei
et
al.
,
2003
)
.
3.1
Latent
Dirichlet
Allocation
LDA
is
a
probabilistic
model
for
collections
of
discrete
data
and
has
been
used
in
document
modeling
and
text
classification
.
It
can
be
represented
as
a
three
level
hierarchical
Bayesian
model
,
shown
graphically
in
Figure
1
.
Given
a
corpus
consisting
of
M
documents
,
LDA
models
each
document
using
a
mixture
over
K
topics
,
which
are
in
turn
characterized
as
distributions
over
words
.
Figure
1
:
Graphical
Model
for
LDA
In
the
generative
process
of
LDA
,
for
each
document
d
we
first
draw
the
mixing
proportion
over
topics
9d
from
a
Dirichlet
prior
with
parameters
a.
Next
,
for
each
of
the
Nd
words
wdn
in
document
d
,
a
topic
zdn
is
first
drawn
from
a
multinomial
distribution
with
parameters
9d
.
Finally
wdn
is
drawn
from
the
topic
specific
distribution
over
words
.
The
probability
of
a
word
token
w
taking
on
value
i
given
that
topic
z
=
j
was
chosen
is
parameterized
using
a
matrix
P
with
=
p
(
w
=
i
|
z
=
j
)
.
Integrating
out
0d
's
and
zdn
's
,
the
probability
p
(
D
|
a
,
P
)
of
the
corpus
is
thus
:
Unfortunately
,
it
is
intractable
to
directly
solve
the
posterior
distribution
ofthe
hidden
variables
given
a
document
,
namely
p
(
9
,
z
|
w
,
a
,
P
)
.
However
,
(
Blei
et
al.
,
2003
)
has
shown
that
by
introducing
a
set
of
variational
parameters
,
7
and
0
,
a
tight
lower
bound
on
the
log
likelihood
of
the
probability
can
be
found
using
the
following
optimization
procedure
:
7
is
the
Dirichlet
parameter
for
0
and
the
multinomial
parameters
(
01
•
•
•
0N
)
are
the
free
variational
parameters
.
Note
here
7
is
document
specific
instead
of
corpus
specific
like
a.
Graphically
,
it
is
represented
as
Figure
2
.
The
optimizing
values
of
7
and
0
can
be
found
by
minimizing
the
Kullback-Leibler
(
KL
)
divergence
between
the
variational
distribution
and
the
true
posterior
.
Figure
2
:
Graphical
Model
for
Variational
Inference
3.2
Baseline
Features
For
both
the
lexical
sample
and
all-words
tasks
,
we
use
the
following
standard
baseline
features
for
comparison
.
POS
Tags
For
each
training
or
testing
word
,
w
,
we
include
POS
tags
for
P
words
prior
to
as
well
as
after
w
within
the
same
sentence
boundary
.
We
also
include
the
POS
tag
of
w.
If
there
are
fewer
than
P
words
prior
or
after
w
in
the
same
sentence
,
we
denote
the
corresponding
feature
as
NIL
.
Local
Collocations
Collocation
Ci
;
j
refers
to
the
ordered
sequence
of
tokens
(
words
or
punctuations
)
surrounding
w.
The
starting
and
ending
position
of
the
sequence
are
denoted
i
and
j
respectively
,
where
a
negative
value
refers
to
the
token
position
prior
to
w.
We
adopt
the
same
11
collocation
features
as
(
Lee
and
Ng
,
2002
)
,
namely
C_i_i
,
CM
,
C-2
,
-2
,
and
C1
;
3
.
Bag-of-Words
For
each
training
or
testing
word
,
w
,
we
get
G
words
prior
to
as
well
as
after
w
,
within
the
same
document
.
These
features
are
position
insensitive
.
The
words
we
extract
are
converted
back
to
their
morphological
root
forms
.
Syntactic
Relations
We
adopt
the
same
syntactic
relations
as
(
Lee
and
Ng
,
2002
)
.
For
easy
reference
,
we
summarize
the
features
into
Table
1
.
Features
Relative
position
of
h
to
w
Adjective
Parent
headword
h
POS
of
h
Table
1
:
Syntactic
Relations
Features
The
exact
values
of
P
and
G
for
each
task
are
set
according
to
cross
validation
result
.
We
first
select
an
unlabeled
corpus
,
such
as
20
Newsgroups
,
and
extract
individual
words
from
it
(
excluding
stopwords
)
.
We
choose
the
number
of
topics
,
K
,
for
the
unlabeled
corpus
and
we
apply
the
LDA
algorithm
to
obtain
the
P
parameters
,
where
P
represents
the
probability
of
a
word
w
»
given
a
topic
zj
,
p
(
wj
|
zj
)
=
Pij
.
The
model
essentially
clusters
words
that
occurred
in
the
unlabeled
corpus
according
to
K
topics
.
The
conditional
probability
p
(
wi
|
zj
)
=
Pij
is
later
used
to
tag
the
words
in
the
unseen
test
example
with
the
probability
of
each
topic
.
For
some
variants
of
the
classifiers
that
we
construct
,
we
also
use
the
7
parameter
,
which
is
document
specific
.
For
these
classifiers
,
we
may
need
to
run
the
inference
algorithm
on
the
labeled
corpus
and
possibly
on
the
test
documents
.
The
7
parameter
provides
an
approximation
to
the
probability
of
selecting
topic
i
in
the
document
:
Ek
V
4
Classifier
Construction
We
construct
a
variant
of
the
naïve
Bayes
network
as
shown
in
Figure
3
.
Here
,
w
refers
to
the
word
.
s
refers
to
the
sense
of
the
word
.
In
training
,
s
is
observed
while
in
testing
,
it
is
not
.
The
features
/
to
/
n
are
baseline
features
mentioned
in
Section
3.2
(
including
bag-of-words
)
while
z
refers
to
the
latent
topic
that
we
set
for
clustering
unlabeled
corpus
.
The
bag-of-words
b
are
extracted
from
the
neighbours
of
w
and
there
are
L
of
them
.
Note
that
L
can
be
different
from
G
,
which
is
the
number
of
bag-of-words
in
baseline
features
.
Both
will
be
determined
by
the
validation
result
.
Figure
3
:
Graphical
Model
with
LDA
feature
The
log
p
(
w
)
term
is
constant
and
thus
can
be
ignored
.
The
first
portion
is
normal
naive
Bayes
.
And
second
portion
represents
the
additional
LDA
plate
.
We
decouple
the
training
process
into
three
separate
stages
.
We
first
extract
baseline
features
from
the
task
training
data
,
and
estimate
,
using
normal
naive
Bayes
,
p
(
s
|
w
)
and
p
(
f
|
s
)
for
all
w
,
s
and
f.
The
parameters
associated
with
p
(
b
|
z
)
are
estimated
using
LDA
from
unlabeled
data
.
Finally
we
estimate
the
parameters
associated
with
p
(
z
|
s
)
.
We
experimented
with
three
different
ways
of
both
doing
the
estimation
as
well
as
using
the
resulting
model
and
chose
one
which
performed
best
empirically
.
4.1.1
Expectation
Maximization
Approach
For
p
(
z
|
s
)
,
a
reasonable
estimation
method
is
to
use
maximum
likelihood
estimation
.
This
can
be
done
using
the
expectation
maximization
(
EM
)
algorithm
.
In
classification
,
we
just
choose
s
*
that
maximizes
the
log-likelihood
of
the
test
instance
,
where
:
In
this
approach
,
7
is
never
used
which
means
the
LDA
inference
procedure
is
not
used
on
any
labeled
data
at
all
.
Classification
in
this
approach
is
done
using
the
full
Bayesian
network
just
as
in
the
EM
approach
.
However
we
do
the
estimation
of
p
(
z
|
s
)
differently
.
Essentially
,
we
perform
LDA
inference
on
the
training
corpus
in
order
to
obtain
7
for
each
document
.
We
then
use
the
7
and
/
/
to
obtain
p
(
z
|
b
)
for
each
word
using
where
equation
[
1
]
is
used
for
estimation
of
p
(
zi
|
7
)
.
This
effectively
transforms
b
to
a
topical
distribution
which
we
call
a
soft
tag
where
each
soft
tag
is
probability
distribution
t1
,
.
.
.
,
tK
on
topics
.
We
then
use
this
topical
distribution
for
estimating
p
(
z
|
s
)
.
Let
si
be
the
observed
sense
of
instance
i
and
t1j
,
.
.
.
,
tj
be
the
soft
tag
of
the
j-th
bag-of-word
feature
of
instance
i.
We
estimate
p
(
z
|
s
)
as
This
approach
requires
us
to
do
LDA
inference
on
the
corpus
formed
by
the
labeled
training
data
,
but
not
the
testing
data
.
This
is
because
we
need
7
to
get
transformed
topical
distribution
in
order
to
learn
p
(
z
|
s
)
in
the
training
.
In
the
testing
,
we
only
apply
the
learnt
parameters
to
the
model
.
Hard
tagging
approach
no
longer
assumes
that
z
is
latent
.
After
p
(
z
|
b
)
is
obtained
using
the
same
procedure
in
Section
4.1.2
,
the
topic
Zj
with
the
highest
p
(
zi
|
b
)
among
all
K
topics
is
picked
to
represent
z.
In
this
way
,
b
is
transformed
into
a
single
most
"
prominent
"
topic
.
This
topic
label
is
used
in
the
same
way
as
baseline
features
for
both
training
and
testing
in
a
simple
naive
Bayes
model
.
This
approach
requires
us
to
perform
the
transformation
both
on
the
training
as
well
as
testing
data
,
since
z
becomes
an
observed
variable
.
LDA
inference
is
done
on
two
corpora
,
one
formed
by
the
training
data
and
the
other
by
testing
data
,
in
order
to
get
the
respective
values
of
7
.
4.2
Support
Vector
Machine
Approach
In
the
SVM
(
Vapnik
,
1995
)
approach
,
we
first
form
a
training
and
a
testing
file
using
all
standard
features
for
each
sense
following
(
Lee
and
Ng
,
2002
)
(
one
classifier
per
sense
)
.
To
incorporate
LDA
feature
,
we
use
the
same
approach
as
Section
4.1.2
to
transform
b
into
soft
tags
,
p
(
z
|
b
)
.
As
SVM
deals
with
only
observed
features
,
we
need
to
transform
b
both
in
the
training
data
and
in
the
testing
data
.
Compared
to
(
Lee
and
Ng
,
2002
)
,
the
only
difference
is
that
for
each
training
and
testing
case
,
we
have
additional
L
*
K
LDA
features
,
since
there
are
L
bag-of-words
and
each
has
a
topic
distribution
represented
by
K
values
.
5
Experimental
Setup
We
describe
here
the
experimental
setup
on
the
English
lexical
sample
task
and
all-words
task
.
We
use
MXPOST
tagger
(
Adwait
,
1996
)
for
POS
tagging
,
Charniak
parser
(
Charniak
,
2000
)
for
extracting
syntactic
relations
,
SVMlight1
for
SVM
classifier
and
David
Blei
's
version
of
LDA2
for
LDA
training
and
inference
.
All
default
parameters
are
used
unless
mentioned
otherwise
.
For
all
standard
baseline
features
,
we
use
Laplace
smoothing
but
for
the
soft
tag
(
equation
[
2
]
)
,
we
use
a
smoothing
parameter
value
of
2
.
We
use
the
Senseval-2
lexical
sample
task
for
preliminary
investigation
of
different
algorithms
,
datasets
and
other
parameters
.
As
the
dataset
is
used
extensively
for
this
purpose
,
only
the
Senseval-3
lexical
sample
task
is
used
for
evaluation
.
Selecting
Bayesian
Network
The
best
achievable
result
,
using
the
three
different
Bayesian
network
approaches
,
when
validating
on
Senseval-2
test
data
is
shown
in
Table
2
.
The
parameters
that
are
used
are
P
=
3
and
G
=
3
.
Hard
Tagging
Soft
Tagging
Table
2
:
Results
on
Senseval-2
English
lexical
sample
using
different
Bayesian
network
approaches
.
From
the
results
,
it
appears
that
both
the
EM
and
the
Hard
Tagging
approaches
did
not
yield
as
good
results
as
the
Soft
Tagging
approach
did
.
The
EM
approach
ignores
the
LDA
inference
result
,
7
,
which
we
use
to
get
our
topic
prior
.
This
information
is
document
specific
and
can
be
regarded
as
global
context
information
.
The
Hard
Tagging
approach
also
uses
less
information
,
as
the
original
topic
distribution
is
now
represented
only
by
the
topic
with
the
highest
probability
ofoccurring
.
Therefore
,
both
methods
have
information
loss
and
are
disadvan-taged
against
the
Soft
Tagging
approach
.
We
use
the
Soft
Tagging
approach
for
the
Senseval-3
lexical
sample
and
the
all-words
tasks
.
Unlabeled
Corpus
Selection
The
unlabeled
corpus
we
choose
to
train
LDA
include
20
Newsgroups
,
Reuters
,
SemCor
,
Senseval-2
lexical
sample
data
and
Senseval-3
lexical
sample
data
.
Although
the
last
three
are
labeled
corpora
,
we
only
need
the
words
from
these
corpora
and
thus
they
can
be
regarded
as
unlabeled
too
.
For
Senseval-2
and
Senseval-3
data
,
we
define
the
whole
passage
for
each
training
and
testing
instance
as
one
document
.
The
relative
effect
using
different
corpus
and
combinations
of
them
is
shown
in
Table
3
,
when
validating
on
Senseval-2
test
data
using
the
Soft
Tagging
approach
.
20
Newsgroups
Table
3
:
Effect
of
using
different
corpus
for
LDA
training
,
|
w
|
represents
the
corpus
size
in
terms
of
the
number
of
words
in
the
corpus
The
20
Newsgroups
corpus
yields
the
best
result
if
used
individually
.
It
has
a
relatively
larger
corpus
size
at
1.7
million
words
in
total
and
also
a
well
balanced
topic
distribution
among
its
documents
,
ranging
across
politics
,
finance
,
science
,
computing
,
etc.
The
Reuters
corpus
,
on
the
other
hand
,
focuses
heavily
on
finance
related
articles
and
has
a
rather
skewed
topic
distribution
.
This
probably
contributed
to
its
inferior
result
.
However
,
we
found
that
the
best
result
comes
from
combining
all
the
corpora
together
with
K
=
60
and
L
=
40
.
Results
for
Optimized
Configuration
As
baseline
for
the
Bayesian
network
approaches
,
we
use
naive
Bayes
with
all
baseline
features
.
For
the
baseline
SVM
approach
,
we
choose
P
=
3
and
include
all
the
words
occurring
in
the
training
and
testing
passage
as
bag-of-words
feature
.
The
F-measure
result
we
achieve
on
Senseval-2
test
data
is
shown
in
Table
4
.
Our
four
systems
are
listed
as
the
top
four
entries
in
the
table
.
Soft
Tag
refers
to
the
soft
tagging
Bayesian
network
approach
.
Note
that
we
used
the
Senseval-2
test
data
for
optimizing
the
configuration
(
as
is
done
in
the
ASO
result
)
.
Hence
,
the
result
should
not
be
taken
as
reliable
.
Nevertheless
,
it
is
worth
noting
that
the
improvement
of
Bayesian
network
approach
over
its
baseline
is
very
significant
(
+5.5
%
)
.
On
the
other
hand
,
SVM
with
topic
features
shows
limited
improvement
over
its
baseline
(
+0.8
%
)
.
SVM-Topic
Classifier
Combination
(
Florian
,
2002
)
Senseval-2
Best
System
Table
4
:
Results
(
best
configuration
)
compared
to
previous
best
systems
on
Senseval-2
English
lexical
sample
task
.
In
the
all-words
task
,
no
official
training
data
is
provided
with
Senseval
.
We
follow
the
common
practice
of
using
the
SemCor
corpus
as
our
training
data
.
However
,
we
did
not
use
SVM
approach
in
this
task
as
there
are
too
few
training
instances
per
sense
for
SVM
to
achieve
a
reasonably
good
accuracy
.
As
there
are
more
training
instances
in
SemCor
,
230
,
000
in
total
,
we
obtain
the
optimal
configuration
using
10
fold
cross
validation
on
the
SemCor
training
data
.
With
the
optimal
configuration
,
we
test
our
system
on
both
Senseval-2
and
Senseval-3
official
test
data
.
For
baseline
features
,
we
set
P
=
3
and
B
=
1
.
We
choose
a
LDA
training
corpus
comprising
20
Newsgroups
and
SemCor
data
,
with
number
of
topics
K
=
40
and
number
of
LDA
bag-of-words
L
=
14
.
6
Results
We
now
present
the
results
on
both
English
lexical
sample
task
and
all-words
task
.
With
the
optimal
configurations
from
Senseval-2
,
we
tested
the
systems
on
Senseval-3
data
.
Table
5
shows
our
F-measure
result
compared
to
some
ofthe
best
reported
systems
.
Although
SVM
with
topic
features
shows
limited
success
with
only
a
0.6
%
improvement
,
the
Bayesian
network
approach
has
again
demonstrated
a
good
improvement
of
3.8
%
over
its
baseline
and
is
better
than
previous
reported
best
systems
except
ASO
(
Ando
,
2006
)
.
Table
5
:
Results
compared
to
previous
best
systems
on
Senseval-3
English
lexical
sample
task
.
to
verify
the
significance
of
these
results
.
The
result
is
reported
in
Table
8
.
The
results
are
significant
at
90
%
confidence
level
,
except
for
the
Senseval-3
all-words
task
.
Senseval-2
Senseval-3
All-word
Lexical
Sample
Table
8
:
P
value
for
%
2-test
significance
levels
of
results
.
The
F-measure
micro-averaged
result
for
our
systems
as
well
as
previous
best
systems
for
Senseval-2
and
Senseval-3
all-words
task
are
shown
in
Table
6
and
Table
7
respectively
.
Bayesian
network
with
soft
tagging
achieved
2.6
%
improvement
over
its
baseline
in
Senseval-2
and
1.7
%
in
Senseval-3
.
The
results
also
rival
some
previous
best
systems
,
except
for
SMUaw
(
Mihalcea
,
2002
)
which
used
additional
labeled
data
.
Bayes
(
Soft
Tag
)
NB
baseline
Table
6
:
Results
compared
to
previous
best
systems
on
Senseval-2
English
all-words
task
.
Senseval-3
Best
System
Senseval-3
2nd
Best
System
(
SenseLearner
Table
7
:
Results
compared
to
previous
best
systems
on
Senseval-3
English
all-words
task
.
6.3
Significance
of
Results
We
perform
the
%
2-test
,
using
the
Bayesian
network
and
its
naive
Bayes
baseline
(
NB
baseline
)
as
pairs
,
The
results
on
lexical
sample
task
show
that
SVM
benefits
less
from
the
topic
feature
than
the
Bayesian
approach
.
One
possible
reason
is
that
SVM
baseline
is
able
to
use
all
bag-of-words
from
surrounding
context
while
naive
Bayes
baseline
can
only
use
very
few
without
decreasing
its
accuracy
,
due
to
the
sparse
representation
.
In
this
sense
,
SVM
baseline
already
captures
some
of
the
topical
information
,
leaving
a
smaller
room
for
improvement
.
In
fact
,
if
we
exclude
the
bag-of-words
feature
from
the
SVM
baseline
and
add
in
the
topic
features
,
we
are
able
to
achieve
almost
the
same
accuracy
as
we
did
with
both
features
included
,
as
shown
in
Table
9
.
This
further
shows
that
the
topic
feature
is
a
better
representation
of
global
context
than
the
bag-of-words
feature
.
SVM
baseline
SVM-topic
Table
9
:
Results
on
Senseval-3
English
lexical
sample
task
6.5
Results
on
Different
Parts-of-Speech
We
analyse
the
result
obtained
on
Senseval-3
English
lexical
sample
task
(
using
Senseval-2
optimal
configuration
)
according
to
the
test
instance
's
part-of-speech
,
which
includes
noun
,
verb
and
adjective
,
compared
to
the
naive
Bayes
baseline
.
Table
10
shows
the
relative
improvement
on
each
part-of-speech
.
The
second
column
shows
the
number
of
testing
instances
belonging
to
the
particular
part-of-speech
.
The
third
and
fourth
column
shows
the
Accuracy
with
varing
L
and
K
on
all-words
task
all-words
task
data
as
our
validation
set
to
fine
tune
the
parameters
.
For
lexical
sample
task
,
we
use
the
training
data
provided
as
the
validation
set
.
We
achieved
88.7
%
,
81.6
%
and
57.6
%
for
coarsegrained
lexical
sample
task
,
coarse-grained
all-words
task
and
fine-grained
all-words
task
respectively
.
The
results
ranked
first
,
second
and
fourth
in
the
three
tasks
respectively
.
7
Conclusion
and
Future
Work
In
this
paper
,
we
showed
that
by
using
LDA
algorithm
on
bag-of-words
feature
,
one
can
utilise
more
topical
information
and
boost
the
classifiers
accuracy
on
both
English
lexical
sample
and
all-words
task
.
Only
unlabeled
data
is
needed
for
this
improvement
.
It
would
be
interesting
to
see
how
the
feature
can
help
on
WSD
of
other
languages
and
other
natural
language
processing
tasks
such
as
named-entity
recognition
.
accuracy
achieved
by
naive
Bayes
baseline
and
the
Bayesian
network
.
Adjectives
show
no
improvement
while
verbs
show
a
moderate
+2.2
%
improvement
.
Nouns
clearly
benefit
from
topical
information
much
more
than
the
other
two
parts-of-speech
,
obtaining
a
+5.7
%
increase
over
its
baseline
.
NB
baseline
Table
10
:
Improvement
with
different
POS
on
Senseval-3
lexical
sample
task
We
tested
on
Senseval-2
all-words
task
using
different
L
and
K.
Figure
4
is
the
result
.
We
participated
in
SemEval-1
English
coarsegrained
all-words
task
(
task
7
)
,
English
fine-grained
all-words
task
(
task
17
,
subtask
3
)
and
English
coarse-grained
lexical
sample
task
(
task
17
,
subtask
1
)
,
using
the
method
described
in
this
paper
.
For
all-words
task
,
we
use
Senseval-2
and
Senseval-3
