It
has
been
widely
observed
that
different
NLP
applications
require
different
sense
granularities
in
order
to
best
exploit
word
sense
distinctions
,
and
that
for
many
applications
WordNet
senses
are
too
fine-grained
.
In
contrast
to
previously
proposed
automatic
methods
for
sense
clustering
,
we
formulate
sense
merging
as
a
supervised
learning
problem
,
exploiting
human-labeled
sense
clusterings
as
training
data
.
We
train
a
discriminative
classifier
over
a
wide
variety
of
features
derived
from
WordNet
structure
,
corpus-based
evidence
,
and
evidence
from
other
lexical
resources
.
Our
learned
similarity
measure
outperforms
previously
proposed
automatic
methods
for
sense
clustering
on
the
task
of
predicting
human
sense
merging
judgments
,
yielding
an
absolute
F-score
improvement
of
4.1
%
on
nouns
,
13.6
%
on
verbs
,
and
4.0
%
on
adjectives
.
Finally
,
we
propose
a
model
for
clustering
sense
taxonomies
using
the
outputs
of
our
classifier
,
and
we
make
available
several
automatically
sense-clustered
WordNets
ofvarious
sense
granularities
.
1
Introduction
Defining
a
discrete
inventory
of
senses
for
a
word
is
extremely
difficult
(
Kilgarriff
,
1997
;
Hanks
,
2000
;
Palmer
et
al.
,
2005
)
.
Perhaps
the
greatest
obstacle
is
the
dynamic
nature
of
sense
definition
:
the
correct
granularity
for
word
senses
depends
on
the
application
.
For
language
learners
,
a
fine-grained
set
of
word
senses
may
help
in
learning
subtle
distinctions
,
while
coarsely-defined
senses
are
probably
more
useful
in
NLP
tasks
like
information
retrieval
(
Gon-zalo
et
al.
,
1998
)
,
query
expansion
(
Moldovan
and
1999
;
Palmer
etal.,2005
)
.
Lexical
resources
such
as
WordNet
(
Fellbaum
,
1998
)
use
extremely
fine-grained
notions
of
word
sense
,
which
carefully
capture
even
minor
distinctions
between
different
possible
word
senses
(
e.g.
,
the
8
noun
senses
of
bass
shown
in
Figure
1
)
.
Producing
sense-clustered
inventories
of
arbitrary
sense
granularity
is
thus
crucial
for
tasks
which
depend
on
lexical
resources
like
WordNet
,
and
is
also
important
for
the
task
of
automatically
constructing
new
WordNet-like
taxonomies
.
A
solution
to
this
problem
must
also
deal
with
the
constraints
of
the
WordNet
taxonomy
itself
;
for
example
when
clustering
two
senses
,
we
need
to
consider
the
transitive
effects
of
merging
synsets
.
The
state
of
the
art
in
sense
clustering
is
insufficient
to
meet
these
needs
.
Current
sense
clustering
algorithms
are
generally
unsupervised
,
each
relying
on
a
different
set
of
useful
features
or
hand-built
rules
.
But
hand-written
rules
have
little
flexibility
to
produce
clusterings
of
different
granularities
,
and
previously
proposed
methods
offer
little
in
the
direction
of
intelligently
combining
and
weighting
the
many
proposed
features
.
In
response
to
these
challenges
,
we
propose
a
new
algorithm
for
clustering
large-scale
sense
hierarchies
like
WordNet
.
Our
algorithm
is
based
on
a
supervised
classifier
that
learns
to
make
graduated
judgments
corresponding
to
the
estimated
probability
that
each
particular
sense
pair
should
be
merged
.
This
classifier
is
trained
on
gold
standard
sense
clustering
judgments
using
a
diverse
feature
space
.
We
are
able
to
use
the
outputs
of
our
classifier
to
produce
a
ranked
list
of
sense
merge
judgments
by
merge
probability
,
and
from
this
create
sense-clustered
inventories
of
arbitrary
sense
granularity.1
In
Section
2
we
discuss
past
work
in
sense
cluster
-
1We
have
made
sense-clustered
Wordnets
using
the
algorithms
discussed
in
this
paper
available
for
download
at
http
:
/
/
ai.stanford.edu
/
~
rion
/
swn.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
1GG5-1G14
,
Prague
,
June
2GG7
.
©
2GG7
Association
for
Computational
Linguistics
1
:
the
lowest
part
of
the
musical
range
2
:
the
lowest
part
in
polyphonic
music
3
:
an
adult
male
singer
with
the
lowest
voice
6
:
the
lowest
adult
male
singing
voice
4
:
the
lean
flesh
of
a
saltwater
fish
of
the
family
Serranidae
5
:
any
of
various
North
American
freshwater
fish
with
lean
flesh
8
:
nontechnical
name
for
any
of
numerous
.
.
.
fishes
INSTRUMENT
7
:
.
.
.
the
lowest
range
of
a
family
of
musical
instruments
Figure
1
:
Sense
clusters
for
the
noun
bass
;
the
eight
WordNet
senses
as
clustered
into
four
groups
in
the
Senseval-2
coarse-grained
evaluation
data
ing
,
and
the
gold
standard
datasets
that
we
use
in
our
work
.
In
Section
3
we
introduce
our
battery
of
features
;
in
Section
4
we
show
how
to
extend
our
sense-merging
model
to
cluster
full
taxonomies
like
WordNet
.
In
Section
5
we
evaluate
our
classifier
against
thirteen
previously
proposed
methods
.
2
Background
A
wide
number
of
manual
and
automatic
techniques
have
been
proposed
for
clustering
sense
inventories
and
mapping
between
sense
inventories
of
different
granularities
.
Much
work
has
gone
into
methods
for
measuring
synset
similarity
;
early
work
in
this
direction
includes
(
Dolan
,
1994
)
,
which
attempted
to
discover
sense
similarities
between
dictionary
senses
.
A
variety
of
synset
similarity
measures
based
on
properties
of
WordNet
itself
have
been
proposed
;
nine
such
measures
are
discussed
in
(
Pedersen
et
al.
,
2004
)
,
including
gloss-based
heuristics
(
Lesk
,
1986
;
Banerjee
and
Pedersen
,
2003
)
,
information-content
based
measures
(
Resnik
,
1995
;
Lin
,
1998
;
Jiang
and
Conrath
,
1997
)
,
and
others
.
Other
approaches
have
used
specific
cues
from
WordNet
structure
to
inform
the
construction
of
semantic
rules
;
for
example
,
(
Peters
et
al.
,
1998
)
suggest
clustering
two
senses
based
on
a
wide
variety
of
structural
cues
from
WordNet
,
including
if
they
are
twins
(
if
two
synsets
share
more
than
one
word
in
their
synonym
list
)
or
if
they
represent
an
example
of
autohyponymy
(
if
one
sense
is
the
direct
descendant
of
the
other
)
.
(
Mihal-cea
and
Moldovan
,
2001
)
implements
six
semantic
rules
,
using
twin
and
autohyponym
features
,
in
addition
to
other
WordNet-structure-based
rules
such
as
whether
two
synsets
share
apertainym
,
antonym
,
or
are
clustered
together
in
the
same
verb
group
.
A
large
body
of
work
has
attempted
to
capture
corpus-based
estimates
of
word
similarity
(
Pereira
et
al.
,
1993
;
Lin
,
1998
)
;
however
,
the
lack
of
large
sense-tagged
corpora
prevent
most
such
techniques
from
being
used
effectively
to
compare
different
senses
of
the
same
word
.
Some
corpus-based
attempts
that
are
capable
of
estimating
similarity
between
word
senses
include
the
topic
signatures
method
;
here
,
(
Agirre
and
Lopez
,
2003
)
collect
contexts
for
a
polysemous
word
based
either
on
sense-tagged
corpora
or
by
using
a
weighted
agglomeration
of
contexts
of
a
polysemous
word
's
monose-mous
relatives
(
i.e.
,
single-sense
synsets
related
by
hypernym
,
hyponym
,
or
other
relations
)
from
some
large
untagged
corpus
.
Other
corpus-based
techniques
developed
specifically
for
sense
clustering
include
(
McCarthy
,
2006
)
,
which
uses
a
combination
of
word-to-word
distributional
similarity
combined
with
the
JCN
WordNet-based
similarity
measure
,
and
work
by
(
Chugur
et
al.
,
2002
)
in
finding
co-occurrences
of
senses
within
documents
in
sense-tagged
corpora
.
Other
attempts
have
exploited
disagreements
between
WSD
systems
(
Agirre
and
Lopez
,
2003
)
or
between
human
labelers
(
Chklovski
and
Mihalcea
,
2003
)
to
create
synset
similarity
measures
;
while
promising
,
these
techniques
are
severely
limited
by
the
performance
of
the
WSD
systems
or
the
amount
of
available
labeled
data
.
Some
approaches
for
clustering
have
made
use
of
regular
patterns
of
polysemy
among
words
.
(
Peters
et
al.
,
1998
)
uses
the
Cousin
relation
defined
in
WordNet
1.5
to
cluster
hyponyms
ofcategorically
related
noun
synsets
,
e.g.
,
"
container
/
quantity
"
(
e.g.
,
for
clustering
senses
of
"
cup
"
or
"
barrel
"
)
or
"
organization
/
construction
"
(
e.g.
,
for
the
building
and
institution
senses
of
"
hospital
"
or
"
school
"
)
;
other
approaches
based
on
systematic
polysemy
include
the
hand-constructed
CORELEX
database
(
Buite-laar
,
1998
)
,
and
automatic
attempts
to
extract
patterns
of
systematic
polysemy
based
on
minimal
description
length
principles
(
Tomuro
,
2001
)
.
Another
family
of
approaches
has
been
to
use
either
manually-annotated
or
automatically-constructed
mappings
to
coarser-grained
sense
inventories
;
an
attempt
at
providing
coarse-grained
sense
distinctions
for
the
Senseval-1
exercise
included
a
mapping
between
WordNet
and
the
Hector
lexicon
(
Palmer
et
al.
,
2005
)
.
Other
attempts
in
this
vein
include
mappings
between
WordNet
and
PropBank
(
Palmer
et
al.
,
2004
)
and
mappings
to
Levin
classes
(
Levin
,
1993
;
Palmer
et
al.
,
2005
)
.
(
Navigli
,
2006
)
presents
an
automatic
approach
for
mapping
between
sense
inventories
;
here
similarities
in
gloss
definition
and
structured
relations
between
the
two
sense
inventories
are
exploited
in
order
to
map
between
WordNet
senses
and
distinctions
made
within
the
coarser-grained
Oxford
English
Dictionary
.
Other
work
has
attempted
to
exploit
translational
equivalences
of
WordNet
senses
in
other
languages
,
for
example
using
foreign
language
WordNet
interlingual
indexes
(
Gonzalo
et
al.
,
1998
;
Chugur
et
al.
,
2002
)
.
2.1
Gold
standard
sense
clustering
data
Our
approach
for
learning
how
to
merge
senses
relies
upon
the
availability
of
labeled
judgments
of
sense
relatedness
.
In
this
work
we
focus
on
two
datasets
of
hand-labeled
sense
groupings
for
WordNet
:
first
,
a
dataset
of
sense
groupings
over
nouns
,
verbs
,
and
adjectives
provided
as
part
of
the
Senseval-2
English
lexical
sample
WSD
task
(
Kilgarriff
,
2001
)
,
and
second
,
a
corpus-driven
mapping
of
nouns
and
verbs
in
WordNet
2.1
to
the
Omega
Ontology
(
Philpot
et
al.
,
2005
)
,
produced
as
part
of
the
OntoNotes
project
(
Hovy
et
al.
,
2006
)
.
A
wide
variety
of
semantic
and
syntactic
criteria
were
used
to
produce
the
Senseval-2
groupings
(
Palmer
et
al.
,
2004
;
Palmer
et
al.
,
2005
)
;
this
data
covers
all
senses
of
411
nouns
,
519
verbs
,
and
257
adjectives
,
and
has
been
used
as
gold
standard
sense
clustering
data
in
previous
work
(
Agirre
and
Lopez
,
2003
;
McCarthy
,
2006
)
2
.
The
number
of
judgments
within
this
data
(
after
mapping
to
WordNet
2.1
)
is
displayed
in
Table
1
.
Due
to
a
lack
of
interannotator
agreement
data
for
this
dataset
,
(
McCarthy
,
2006
)
performed
an
annotation
study
using
three
labelers
on
a
20-noun
subset
of
the
Senseval-2
groupings
;
the
three
labelers
were
given
the
task
of
deciding
whether
the
351
potentially-related
sense
pairs
were
"
Related
"
,
"
Unrelated
"
,
or
"
Don
't
Know
"
.
3
In
this
task
the
pair
-
2In
order
to
facilitate
future
work
in
this
area
,
we
have
made
cleaned
versions
of
these
groupings
available
at
http
:
/
/
ai.stanford.edu
/
~
rion
/
swn
along
with
a
"
diff
"
with
the
original
files
.
3McCarthy
's
gold
standard
data
is
available
at
Senseval-2
Total
Pairs
Merged
Pairs
Proportion
Table
1
:
Gold
standard
datasets
for
sense
merging
;
only
sense
pairs
that
share
a
word
in
common
are
included
;
proportion
refers
to
the
fraction
of
synsets
sharing
a
word
that
have
been
merged
ON-False
Nouns
Verbs
Table
2
:
Agreement
data
for
gold
standard
datasets
wise
interannotator
F-scores
were
(
0.4874
,
0.5454
,
0.7926
)
,
for
an
average
F-score
of
0.6084
.
The
OntoNotes
dataset4
covers
a
smaller
set
of
nouns
and
verbs
,
but
it
has
been
created
with
a
more
rigorous
corpus-based
iterative
annotation
process
.
For
each
of
the
nouns
and
verbs
in
question
,
a
50-sentence
sample
of
instances
is
annotated
using
a
preliminary
set
of
sense
distinctions
;
if
the
word
sense
interannotator
agreement
for
the
sample
is
less
than
90
%
,
then
the
sense
distinctions
are
revised
and
the
sample
is
re-annotated
,
and
so
forth
,
until
an
in-terannotator
agreement
of
at
least
90
%
is
reached
.
We
construct
a
combined
gold
standard
set
from
these
Senseval-2
and
OntoNotes
groupings
,
removing
disagreements
.
The
overlap
and
agreement
/
disagreement
data
between
the
two
groupings
is
given
in
Table
2
;
here
,
for
example
,
the
column
with
ON-True
and
S-F
indicates
the
count
of
senses
that
OntoNotes
judged
as
positive
examples
of
sense
merging
,
but
that
Senseval-2
data
did
not
merge
.
We
also
calculate
the
F-score
achieved
by
considering
only
one
of
the
datasets
as
a
gold
standard
,
and
computing
precision
and
recall
for
the
other
.
Since
the
two
datasets
were
created
independently
,
with
different
annotation
guidelines
,
we
can
-
ftp
:
/
/
ftp.informatics.susx.ac.uk
/
pub
/
users
/
dianam
/
relateGS
/
.
4The
OntoNotes
groupings
will
be
available
through
the
LDC
at
http
:
/
/
www.ldc.upenn.edu
.
not
consider
this
as
a
valid
estimate
of
interannota-tor
agreement
;
nonetheless
the
F-score
for
the
two
datasets
on
the
overlapping
set
of
sense
judgments
(
50.6
%
for
nouns
and
50.7
%
for
verbs
)
is
roughly
in
the
same
range
as
those
observed
in
(
McCarthy
,
2006
)
.
3
Learning
to
merge
word
senses
3.1
WordNet-based
features
Here
we
describe
the
feature
space
we
construct
for
classifying
whether
or
not
a
pair
of
synsets
should
be
merged
;
first
,
we
employ
a
wide
variety
of
linguistic
features
based
on
information
derived
from
WordNet
.
We
use
eight
similarity
measures
implemented
within
the
WordNet
:
:
Similarity
package5
,
described
in
(
Pedersen
et
al.
,
2004
)
;
these
include
three
measures
derived
from
the
paths
between
the
synsets
in
WordNet
:
hso
(
Hirst
and
St-Onge
,
1998
)
,
lch
(
Leacock
and
Chodorow
,
1998
)
,
and
wup
(
Wu
and
Palmer
,
1994
)
;
three
measures
based
on
information
content
:
res
(
Resnik
,
1995
)
,
lin
(
Lin
,
1998
)
,
and
JCN
(
Jiang
and
Conrath
,
1997
)
;
the
gloss-based
Extended
Lesk
Measure
lesk
,
(
Banerjee
and
Peder-sen
,
2003
)
,
and
finally
the
gloss
vector
similarity
measure
vector
(
Patwardan
,
2003
)
.
We
implement
the
twin
feature
(
Peters
et
al.
,
1998
)
,
which
counts
the
number
of
shared
synonyms
between
the
two
synsets
.
Additionally
we
produce
pair-wise
features
indicating
whether
two
senses
share
an
antonym
,
pertainym
,
or
derivationally-related
forms
(
deriv
)
.
We
also
create
the
verb-specific
features
of
whether
two
verb
synsets
are
linked
in
a
VerbGroup
(
indicating
semantic
similarity
)
or
share
a
VerbFrame
,
indicating
syntactic
similarity
.
Also
,
we
encode
a
generalized
notion
of
sib-linghood
in
the
MN
features
,
recording
the
distance
of
the
synset
pair
's
nearest
least
common
subsumer
(
i.e.
,
closest
shared
hypernym
)
from
the
two
synsets
,
and
,
separately
,
the
maximum
of
those
distances
(
in
the
MaxMN
feature
.
Previous
attempts
at
categorizing
systematic
polysemy
patterns
within
WordNet
has
resulted
in
the
Cousin
feature6
;
we
create
binary
features
which
5We
choose
not
to
use
the
path
measure
due
to
its
negligible
difference
from
the
l
c
h
measure
.
6This
data
is
included
in
the
WordNet
1.6
distribution
as
the
"
cousin.tops
"
file
.
indicate
whether
a
synset
pair
belong
to
hypernym
ancestries
indicated
by
one
or
more
of
these
Cousin
features
,
and
the
specific
cousin
pair
(
s
)
involved
.
Finally
we
create
sense-specific
features
,
including
SenseCount
,
the
total
number
of
senses
associated
with
the
shared
word
between
the
two
synsets
with
the
highest
number
of
senses
,
and
SenseNum
,
the
specific
pairing
of
senses
for
the
shared
word
with
the
highest
number
of
senses
(
which
might
allow
us
to
learn
whether
the
most
frequent
sense
of
a
word
has
a
higher
chance
of
having
similar
derivative
senses
with
lower
frequency
)
.
3.2
Features
derived
from
corpora
and
other
lexical
resources
In
addition
to
WordNet-based
features
,
we
use
a
number
of
features
derived
from
corpora
and
other
lexical
resources
.
We
use
the
publicly
available
topic
signature
data7
described
in
(
Agirre
and
Lopez
,
2004
)
,
yielding
representative
contexts
for
all
nominal
synsets
from
WordNet
1.6
.
These
topic
signatures
were
obtained
by
weighting
the
contexts
of
monosemous
relatives
of
each
noun
synset
(
i.e.
,
single-sense
synsets
related
by
hypernym
,
hyponym
,
or
other
relations
)
;
the
text
for
these
contexts
were
extracted
from
snippets
using
the
Google
search
engine
.
We
then
create
a
sense
similarity
feature
by
taking
a
thresholded
cosine
similarity
between
pairs
of
topic
signatures
for
these
noun
synsets
.
Additionally
,
we
use
the
WordNet
domain
dataset
described
in
(
Magnini
and
Cavaglia
,
2000
;
Ben-tivogli
et
al.
,
2004
)
.
This
dataset
contains
one
or
more
labels
indicating
of
164
hierarchically
organized
"
domains
"
or
"
subject
fields
"
for
each
noun
,
verb
,
and
adjective
synset
in
WordNet
;
we
derive
a
set
of
binary
features
from
this
data
,
with
a
single
feature
indicating
whether
or
not
two
synsets
share
a
domain
,
and
one
indicator
feature
per
pair
of
domains
indicating
respective
membership
ofthe
sense
pair
within
those
domains
.
Finally
,
we
use
as
a
feature
the
mappings
produced
in
(
Navigli
,
2006
)
of
WordNet
senses
to
Oxford
English
Dictionary
senses
.
This
OED
dataset
was
used
as
the
coarse-grained
sense
inventory
in
the
Coarse-grained
English
all-words
task
of
SemEval
-
7The
topic
signature
data
is
available
for
download
at
http
:
/
/
ixa.si.ehu.es
/
Ixa
/
resources
/
sensecorpus
.
20078
;
we
specify
a
single
binary
feature
for
each
pair
of
synsets
from
this
data
;
this
feature
is
true
if
the
words
are
clustered
in
the
OED
mapping
,
and
false
otherwise
.
3.3
Classifier
,
training
,
and
feature
selection
For
each
part
of
speech
,
we
split
the
merged
gold
standard
data
into
a
part-of-speech-specific
training
set
(
70
%
)
and
a
held-out
test
set
(
30
%
)
.
For
every
synset
pair
we
use
the
binary
"
merged
"
or
"
not-merged
"
labels
to
train
a
support
vector
machine
classifier9
(
Joachims
,
2002
)
for
each
POS-specific
training
set
.
We
perform
feature
selection
and
regularization
parameter
optimization
using
10fold
cross-validation
.
4
Clustering
Senses
in
WordNet
The
previous
section
describes
a
classifier
which
predicts
whether
two
synsets
should
be
merged
;
we
would
like
to
use
the
pairwise
judgments
of
this
classifier
to
cluster
the
senses
within
a
sense
hierarchy
.
In
this
section
we
present
the
challenge
implicit
in
applying
sense
merging
to
full
taxonomies
,
and
present
our
model
for
clustering
within
a
taxonomy
.
4.1
Challenges
of
clustering
a
sense
taxonomy
The
task
of
clustering
a
sense
taxonomy
presents
certain
challenges
not
present
in
the
problem
ofclus-tering
the
senses
of
a
word
;
in
order
to
create
a
consistent
clustering
of
a
sense
hierarchy
an
algorithm
must
consider
the
transitive
effects
of
merging
synsets
.
This
problem
is
compounded
in
sense
taxonomies
like
WordNet
,
where
each
synset
may
have
additional
structured
relations
,
e.g.
,
hypernym
(
ISA
)
or
holonym
(
is-part-of
)
links
.
In
order
to
consistently
merge
two
noun
senses
with
different
hyper-nym
ancestries
within
WordNet
,
for
example
,
an
algorithm
must
decide
whether
to
have
the
new
sense
inherit
both
hypernym
ancestries
,
or
whether
to
inherit
only
one
,
and
if
so
it
must
decide
which
ancestry
is
more
relevant
for
the
merged
sense
.
Without
strict
checking
,
human
labelers
will
likely
find
it
difficult
to
label
a
sense
inventory
with
8http
:
/
/
lcl.di.uniroma1.it
/
coarse-grained-aw
/
index.html
9We
use
the
SVMperf
package
,
freely
available
for
noncommercial
use
from
http
:
/
/
svmlight.joachims.org
;
we
use
the
default
settings
in
v2.00
,
except
for
the
regularization
parameter
(
set
in
10-fold
cross-validation
)
.
consider
obligatory
;
request
and
expect
make
someone
do
something
Clustering
based
on
''
require
''
require
as
useful
,
just
,
or
proper
have
need
of
have
or
feel
a
need
for
Figure
2
:
Inconsistent
sense
clusters
for
the
verbs
require
and
need
from
Senseval-2
judgments
transitively-consistent
judgments
.
As
an
example
,
consider
the
Senseval-2
clusterings
of
the
verbs
require
and
need
,
as
shown
in
Figure
2
.
In
WN
2.1
require
has
four
verb
senses
,
of
which
the
first
has
synonyms
[
necessitate
,
ask
,
postulate
,
need
,
take
,
involve
,
call
for
,
demand
)
,
and
gloss
"
require
as
useful
,
just
,
or
proper
"
;
and
the
fourth
has
synonyms
[
want
,
need
)
,
and
gloss
"
have
need
of
.
"
Within
the
word
require
,
the
Senseval-2
dataset
clusters
senses
1
and
4
,
leaving
the
rest
unclustered
.
In
order
to
make
a
consistent
clustering
with
respect
to
the
sense
inventory
,
however
,
we
must
enforce
the
transitive
closure
by
merging
the
synset
corresponding
to
the
first
sense
(
necessitate
,
ask
,
need
etc.
)
,
with
the
senses
of
want
and
need
in
the
fourth
sense
.
In
particular
,
these
two
senses
correspond
to
WordNet
2.1
senses
need#v#1
and
need#v#2
,
respectively
,
which
are
not
clustered
according
to
the
Senseval-2
word-specific
labeling
for
need
-
need#v#1
is
listed
as
a
singleton
(
i.e.
,
unclustered
)
sense
,
though
need#v#2
is
clustered
with
need#v#3
,
"
have
or
feel
a
need
for
.
"
While
one
might
hope
that
such
disagreements
between
sense
clusterings
are
rare
,
we
found
178
such
transitive
closure
disagreements
in
the
Senseval-2
data
.
The
OntoNotes
data
is
much
cleaner
in
this
respect
,
most
likely
due
to
the
stricter
annotation
standard
(
Hovy
et
al.
,
2006
)
;
we
found
only
one
transitive
closure
disagreement
in
the
OntoNotes
data
,
specifically
WordNet
2.1
synsets
(
head#n#2
,
lead#n#7
:
"
be
in
charge
of
)
and
(
head#n#3
,
lead#v#4
:
"
travel
in
front
of
)
are
clustered
under
head
but
not
under
lead
.
4.2
Sense
clustering
within
a
taxonomy
As
a
solution
to
the
previously
mentioned
challenges
,
in
order
to
produce
taxonomies
of
different
sense
granularities
with
consistent
sense
distinctions
we
propose
to
apply
agglomerative
clustering
over
all
synsets
in
WordNet
2.1
.
While
one
might
consider
recalculating
synset
similarity
features
after
each
synset
merge
operation
,
depending
on
the
feature
set
this
could
be
prohibitively
expensive
;
for
our
purposes
we
use
average-link
agglomerative
clustering
,
in
effect
approximating
the
the
pairwise
similarity
score
between
a
given
synset
and
a
merged
sense
as
the
average
of
the
similarity
scores
between
the
given
synset
and
the
clustered
sense
's
component
synsets
.
Further
,
for
the
purpose
of
sense
clustering
we
assume
a
zero
sense
similarity
score
between
synsets
with
no
intersecting
words
.
Without
exploiting
additional
hypernym
or
coordinate-term
evidence
,
our
algorithm
does
not
distinguish
between
judgments
about
which
hypernym
ancestry
or
other
structured
relationships
to
keep
or
remove
upon
merging
two
synsets
.
In
lieu
of
additional
evidence
,
for
our
experiments
we
choose
to
retain
only
the
hypernym
ancestry
of
the
sense
with
the
highest
frequency
in
SemCor
,
breaking
frequency
ties
by
choosing
the
first-listed
sense
in
WordNet
.
We
add
every
other
relationship
(
meronyms
,
entailments
,
etc.
)
to
the
new
merged
sense
(
except
in
the
rare
case
where
adding
a
relation
would
cause
a
cycle
in
acyclic
relations
like
hypernymy
or
holonymy
,
in
which
case
we
omit
it
)
.
using
this
clustering
method
we
have
produced
several
sense-clustered
WordNets
of
varying
sense
granularity
,
which
we
evaluate
in
Section
5.3
.
5
Evaluation
We
evaluate
our
classifier
in
a
comparison
with
thirteen
previously
proposed
similarity
measures
and
automatic
methods
for
sense
clustering
.
We
conduct
a
feature
ablation
study
to
explore
the
relevance
of
the
different
features
in
our
system
.
Finally
,
we
evaluate
the
sense-clustered
taxonomies
we
create
on
the
problem
of
providing
improved
coarse-grained
sense
distinctions
for
WSD
evaluation
.
5.1
Evaluation
of
automatic
sense
merging
We
evaluate
our
classifier
on
two
held-out
test
sets
;
first
,
a
30
%
sample
of
the
sense
judgments
from
the
merged
gold
standard
dataset
consisting
of
both
the
Senseval-2
and
OntoNotes
sense
judgments
;
and
,
second
,
a
test
set
consisting
of
only
the
OntoNotes
subset
of
our
first
held-out
test
set
.
For
comparison
we
implement
thirteen
of
the
methods
discussed
in
Section
2
.
First
,
we
evaluate
each
of
the
eight
WordNet
:
:
Similarity
measures
individually
.
Next
,
we
implement
cosine
similarity
of
topic
signatures
(
TopSig
)
built
from
monosemous
relatives
(
Agirre
and
Lopez
,
2003
)
,
which
provides
a
real-valued
similarity
score
for
noun
synset
pairs
.
Additionally
,
we
implement
the
two
methods
proposed
in
(
Peters
et
al.
,
1998
)
,
namely
using
metonymy
clusters
(
MetClust
)
and
generalization
clusters
(
GenClust
)
based
on
the
Cousin
relationship
in
WordNet
.
While
(
Peters
et
al.
,
1998
)
only
considers
four
cousin
pairs
,
we
re-implement
their
method
for
general
purpose
sense
clustering
by
using
all
226
cousin
pairs
defined
in
WordNet
1.6
,
mapped
to
WordNet
2.1
synsets
.
These
methods
each
provide
a
single
clustering
of
noun
synsets
.
Next
,
we
implement
the
set
of
semantic
rules
described
in
(
Mihalcea
and
Moldovan
,
2001
)
(
MiMo
)
;
this
algorithm
for
merging
senses
is
based
on
6
semantic
rules
,
in
effect
using
a
subset
of
the
Twin
,
MaxMN
,
Pertainym
,
Antonym
,
and
Verb-Group
features
;
in
our
implementation
we
set
the
parameter
for
when
to
cluster
based
on
number
of
twins
to
K
=
2
;
this
results
in
a
single
clustering
for
each
of
nouns
,
verbs
,
and
adjectives
.
Finally
,
we
compare
against
the
mapping
from
WordNet
to
the
Oxford
English
Dictionary
constructed
in
(
Navigli
,
2006
)
,
equivalent
to
clustering
based
solely
on
the
OED
feature
.
Considering
merging
senses
as
a
binary
classification
task
,
Table
3
gives
the
F-score
performance
of
our
classifier
vs.
the
thirteen
other
classifiers
and
an
uninformed
"
merge
all
synsets
"
baseline
on
our
held-out
gold
standard
test
set
.
This
table
shows
that
our
SVM
classifier
outperforms
all
implemented
methods
on
the
basis
of
F-score
on
both
datasets
OntoNotes
Baseline
GenClust
MetClust
Table
3
:
F-score
sense
merging
evaluation
on
handlabeled
testsets
for
all
parts
of
speech
.
In
Figure
3
we
give
a
precision
/
recall
plot
for
noun
sense
merge
judgments
for
the
Senseval-2
+
OntoNotes
dataset
.
For
sake
of
simplicity
we
plot
only
the
two
best
measures
(
RES
and
WUP
)
of
the
eight
WordNet-based
similarity
measures
;
we
see
that
our
classifier
,
RES
,
and
Wu
P
each
have
higher
precision
all
levels
of
recall
compared
to
the
other
tested
measures
.
Of
the
methods
we
compare
against
,
only
the
WordNet-based
similarity
measures
,
(
Mihalcea
and
Moldovan
,
2001
)
,
and
(
Navigli
,
2006
)
provide
a
method
for
predicting
verb
similarities
;
our
learned
measure
widely
outperforms
these
methods
,
achieving
a
13.6
%
F-score
improvement
over
the
Lesk
similarity
measure
.
In
Figure
4
we
give
a
precision
/
recall
plot
for
verb
sense
merge
judgments
,
plotting
the
performance
of
the
three
best
WordNet-based
similarity
measures
;
here
we
see
that
our
classifier
has
significantly
higher
precision
than
all
other
tested
measures
at
nearly
every
level
of
recall
.
Only
the
measures
provided
by
Lesk
,
HSO
,
VEC
,
(
Mihalcea
and
Moldovan
,
2001
)
,
and
(
Nav-igli
,
2006
)
provide
a
method
for
predicting
adjective
similarities
;
of
these
,
only
Lesk
and
Vec
outperform
the
uninformed
baseline
on
adjectives
,
while
our
learned
measure
achieves
a
4.0
%
improvement
over
the
Lesk
measure
on
adjectives
.
5.2
Feature
analysis
Next
we
analyze
our
feature
space
.
Table
4
gives
the
ablation
analysis
for
all
features
used
in
our
system
Precision
/
Recall
for
Merging
Nouns
Figure
3
:
Precision
/
Recall
plot
for
noun
sense
merge
judgments
as
evaluated
on
our
held-out
test
set
;
here
the
quantity
listed
in
the
table
is
the
F-score
loss
obtained
by
removing
that
single
feature
from
our
feature
space
,
and
retraining
and
retesting
our
classifiers
,
keeping
everything
else
the
same
.
Here
negative
scores
correspond
to
an
improvement
in
classifier
performance
with
the
removal
of
the
feature
.
For
noun
classification
,
the
three
features
that
yield
the
highest
gain
in
testset
F-score
are
the
topic
signature
,
OED
,
and
derivational
link
features
,
yielding
a
4.0
%
,
3.6
%
,
and
3.5
%
gain
,
respectively
.
For
verb
classification
,
we
find
that
three
features
yield
more
than
a
5
%
F-score
gain
;
by
far
the
largest
single-feature
performance
gain
for
verb
classification
found
in
our
ablation
study
was
the
Deriv
feature
,
i.e.
,
the
count
of
shared
derivational
links
between
the
two
synsets
;
this
single
feature
improves
our
maximum
F-score
by
9.8
%
on
the
testset
.
This
is
a
particularly
interesting
discovery
,
as
none
of
the
referenced
automatic
techniques
for
sense
clustering
presently
make
use
of
this
very
useful
feature
.
We
also
achieve
large
gains
with
the
Lin
and
Lesk
similarity
features
,
with
F-score
improvement
of
7.4
%
and
5.4
%
gain
respectively
.
For
adjective
classification
again
the
Deriv
feature
proved
very
helpful
,
with
a
3.5
%
gain
on
the
testset
.
Interestingly
,
only
the
Deriv
feature
and
the
SenseCnt
features
helped
across
all
parts
of
speech
;
in
many
cases
a
feature
which
proved
to
be
Precision
/
Recall
for
Merging
Verbs
—
SVM
Classifier
Lesk
Measure
Hirst
&amp;
St-Onge
o
Wu
&amp;
Palmer
X
OED
Mapping
A
Semantic
Rules
Figure
4
:
Precision
/
Recall
plot
for
verb
sense
merge
judgments
very
helpful
for
one
part
of
speech
actually
hurt
performance
on
another
part
of
speech
(
e.g.
,
LIN
on
nouns
and
OED
on
adjectives
)
.
5.3
Evaluation
of
sense-clustered
Wordnets
Our
goal
in
clustering
a
sense
taxonomy
is
to
produce
fully
sense-clustered
WordNets
,
and
to
be
able
to
produce
coarse-grained
Wordnets
at
many
different
levels
of
resolution
.
In
order
to
evaluate
the
entire
sense-clustered
taxonomy
,
we
have
employed
an
evaluation
method
inspired
by
Word
Sense
Disambiguation
(
this
is
similar
to
an
evaluation
used
in
Navigli
,
2006
,
however
we
do
not
remove
monose-mous
clusters
)
.
Given
past
system
responses
in
the
Senseval-3
English
all-words
task
,
we
can
evaluate
past
systems
on
the
same
corpus
,
but
using
the
coarse-grained
sense
hierarchy
provided
by
our
sense-clustered
taxonomy
.
We
may
then
compare
the
scores
of
each
system
on
the
coarse-grained
task
against
their
scores
given
a
random
clustering
at
the
same
resolution
.
Our
expectation
is
that
,
if
our
sense
clustering
is
much
better
than
a
random
sense
clustering
(
and
,
of
course
,
that
the
WSD
algorithms
perform
better
than
random
guessing
)
,
we
will
see
a
marked
improvement
in
the
performance
of
WSD
algorithms
using
our
coarse-grained
sense
hierarchy
.
Adjectives
F-Score
Ablation
Difference
SenseNum
SenseCnt
Pertainym
Table
4
:
Feature
ablation
study
;
F-score
difference
obtained
by
removal
of
the
single
feature
2004
)
.
A
guess
by
a
system
is
given
full
credit
if
it
was
either
the
correct
answer
or
if
it
was
in
the
same
cluster
as
the
correct
answer
.
Clearly
any
amount
of
clustering
will
only
increase
WSD
performance
.
Therefore
,
to
account
for
this
natural
improvement
and
consider
only
the
effect
of
our
particular
clustering
,
we
also
calculate
the
expected
score
for
a
random
clustering
of
the
same
granularity
,
as
follows
:
Let
C
represent
the
set
of
clusters
over
the
possible
N
synsets
containing
a
given
word
;
we
then
calculate
the
expectation
that
an
incorrectly-chosen
sense
and
the
actual
correct
sense
would
be
clustered
together
in
the
random
clustering
Our
sense
clustering
algorithm
provides
little
improvement
over
random
clustering
when
too
few
or
too
many
clusters
are
chosen
;
however
,
with
an
appropriate
threshold
for
average-link
clustering
we
find
a
maximum
of
3.55
%
F-score
improvement
in
WSD
over
random
clustering
(
averaged
over
the
decisions
of
the
top
3
WSD
algorithms
)
.
Table
5
shows
the
improvement
of
the
three
top
WSD
algorithms
given
a
sense
clustering
created
by
our
algorithm
vs.
a
random
clustering
at
the
same
granularity
.
—
Group
Average
Agglomerative
Clustering
Sense
Merge
Iterations
x
io
"
Figure
5
:
WSD
Improvement
with
coarse-grained
sense
hierarchies
Avg-link
Impr
.
Gambl
SenseLearner
KOC
Univ
.
Table
5
:
Improvement
in
Senseval-3
WSD
performance
using
our
average-link
agglomerative
clustering
vs.
random
clustering
at
the
same
granularity
6
Conclusion
We
have
presented
a
classifier
for
automatic
sense
merging
that
significantly
outperforms
previously
proposed
automatic
methods
.
In
addition
to
its
novel
use
of
supervised
learning
and
the
integration
of
many
previously
proposed
features
,
it
is
interesting
that
one
of
our
new
features
,
the
Deriv
count
of
shared
derivational
links
between
two
synsets
,
proved
an
extraordinarily
useful
new
cue
for
sense-merging
,
particularly
for
verbs
.
We
also
show
how
to
integrate
this
sense-merging
algorithm
into
a
model
for
sense
clustering
full
sense
taxonomies
like
WordNet
,
incorporating
taxonomic
constraints
such
as
the
transitive
effects
of
merging
synsets
.
Using
this
model
,
we
have
produced
several
WordNet
taxonomies
of
various
sense
granularities
;
we
hope
these
new
lexical
resources
will
be
useful
for
NLP
applications
that
require
a
coarser-grained
sense
hierarchy
than
that
already
found
in
WordNet
.
Acknowledgments
Thanks
to
Marie-Catherine
de
Marneffe
,
Mona
Diab
,
Christiane
Fellbaum
,
Thad
Hughes
,
and
Benjamin
Packer
for
useful
discussions
.
Rion
Snow
is
supported
by
an
NSF
Fellowship
.
This
work
was
supported
in
part
by
the
Disruptive
Technology
Office
(
DTO
)
'
s
Advanced
Question
Answering
for
Intelligence
(
AQUAINT
)
Phase
III
Program
.
