In
this
paper
we
explore
the
use
of
se-lectional
preferences
for
detecting
non-compositional
verb-object
combinations
.
To
characterise
the
arguments
in
a
given
grammatical
relationship
we
experiment
with
three
models
of
selectional
preference
.
Two
use
WordNet
and
one
uses
the
entries
from
a
distributional
thesaurus
as
classes
for
representation
.
In
previous
work
on
selectional
preference
acquisition
,
the
classes
used
for
representation
are
selected
according
to
the
coverage
of
argument
tokens
rather
than
being
selected
according
to
the
coverage
of
argument
types
.
In
our
distributional
thesaurus
models
and
one
of
the
methods
using
WordNet
we
select
classes
for
representing
the
preferences
by
virtue
of
the
number
of
argument
types
that
they
cover
,
and
then
only
tokens
under
these
classes
which
are
representative
of
the
argument
head
data
are
used
to
estimate
the
probability
distribution
for
the
selectional
preference
model
.
We
demonstrate
a
highly
significant
correlation
between
measures
which
use
these
'
type-based
'
selectional
preferences
and
composi-tionality
judgements
from
a
data
set
used
in
previous
research
.
The
type-based
models
perform
better
than
the
models
which
use
tokens
for
selecting
the
classes
.
Furthermore
,
the
models
which
use
the
automatically
acquired
thesaurus
entries
produced
the
best
results
.
The
correlation
for
the
thesaurus
models
is
stronger
than
any
of
the
individ
-
ual
features
used
in
previous
research
on
the
same
dataset
.
1
Introduction
Baldwin
et
al.
,
2003
;
McCarthy
et
al.
,
2003
;
Bannard
,
2005
;
Venkatapathy
and
Joshi
,
2005
)
.
Typically
the
phrases
are
putative
multiwords
and
non-compositionality
is
viewed
as
an
important
feature
of
many
such
"
words
with
spaces
"
(
Sag
et
al.
,
2002
)
.
For
applications
such
as
paraphrasing
,
information
extraction
and
translation
,
it
is
essential
to
take
the
words
of
non-compositional
phrases
together
as
a
unit
because
the
meaning
of
a
phrase
cannot
be
obtained
straightforwardly
from
the
constituent
words
.
In
this
work
we
are
investigate
methods
of
determining
semantic
compositionality
of
verb-object
1
combinations
on
a
continuum
following
previous
research
in
this
direction
(
McCarthy
et
al.
,
2003
;
Venkatapathy
and
Joshi
,
2005
)
.
Much
previous
research
has
used
a
combination
of
statistics
and
distributional
approaches
whereby
distributional
similarity
is
used
to
compare
the
constituents
of
the
multiword
with
the
multiword
itself
.
In
this
paper
,
we
will
investigate
the
use
of
selec-tional
preferences
of
verbs
.
We
will
use
the
preferences
to
find
atypical
verb-object
combinations
as
we
anticipate
that
such
combinations
are
more
likely
to
be
non-compositional
.
'
We
use
object
to
refer
to
direct
objects
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
369-379
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
Selectional
preferences
of
predicates
have
been
modelled
using
the
man-made
thesaurus
WordNet
(
Fellbaum
,
1998
)
,
see
for
example
(
Resnik
,
1993
;
Li
and
Abe
,
1998
;
Abney
and
Light
,
1999
;
Clark
and
Weir
,
2002
)
.
There
are
also
distributional
approaches
which
use
co-occurrence
data
to
cluster
distributionally
similar
words
together
.
The
cluster
output
can
then
be
used
as
classes
for
se-lectional
preferences
(
Pereira
et
al.
,
1993
)
,
or
one
can
directly
use
frequency
information
from
distri-butionally
similar
words
for
smoothing
(
Grishman
and
Sterling
,
1994
)
.
We
used
three
different
types
of
probabilistic
models
,
which
vary
in
the
classes
selected
for
representation
over
which
the
probability
distribution
of
the
argument
heads
2
is
estimated
.
Two
use
WordNet
and
the
other
uses
the
entries
in
a
thesaurus
of
distri-butionally
similar
words
acquired
automatically
following
(
Lin
,
1998
)
.
The
first
method
is
due
to
Li
and
Abe
(
1998
)
.
The
classes
over
which
the
probability
distribution
is
calculated
are
selected
according
to
the
minimum
description
length
principle
(
mdl
)
which
uses
the
argument
head
tokens
for
finding
the
best
classes
for
representation
.
This
method
has
previously
been
tried
for
modelling
compositionality
of
verb-particle
constructions
(
Bannard
,
2002
)
.
The
other
two
methods
(
we
refer
to
them
as
'
type-based
'
)
also
calculate
a
probability
distribution
using
argument
head
tokens
but
they
select
the
classes
over
which
the
distribution
is
calculated
using
the
number
of
argument
head
types
(
of
a
verb
in
a
corpus
)
in
a
given
class
,
rather
than
the
number
of
argument
head
tokens
in
contrast
to
previous
WordNet
models
(
Resnik
,
1993
;
Li
and
Abe
,
1998
;
Clark
and
Weir
,
2002
)
.
For
example
,
if
the
object
slot
of
the
verb
park
contains
the
argument
heads
{
car
,
car
,
car
,
car
,
van
,
jeep
}
then
the
type-based
models
use
the
word
type
"
car
"
only
once
when
determining
the
classes
over
which
the
probability
distribution
is
to
be
estimated
.
Classes
are
selected
which
maximise
the
number
of
types
that
they
cover
,
rather
than
the
number
of
tokens
.
This
is
done
to
avoid
the
selec-tional
preferences
being
heavily
influenced
by
noise
from
highly
frequent
arguments
which
may
be
poly-semous
and
some
or
all
of
their
meanings
may
not
be
2Argument
heads
are
the
nouns
occurring
in
the
object
slot
of
the
target
verb
.
semantically
related
to
the
'
prototypical
'
arguments
of
the
verb
.
For
example
car
has
a
gondola
sense
in
WordNet
.
The
third
method
uses
entries
in
a
distributional
thesaurus
rather
than
classes
from
WordNet
.
The
entries
used
as
classes
for
representation
are
selected
by
virtue
of
the
number
of
argument
types
they
encompass
.
As
with
the
WordNet
models
,
the
tokens
are
used
to
estimate
a
probability
distribution
over
these
entries
.
In
the
next
section
,
we
discuss
related
work
on
identifying
compositionality
.
In
section
3
,
we
describe
the
methods
we
are
using
for
acquiring
our
models
of
selectional
preference
.
In
section
4
,
we
test
our
models
on
a
dataset
used
in
previous
research
.
We
compare
the
three
types
of
models
individually
and
also
investigate
the
best
performing
model
when
used
in
combination
with
other
features
used
in
previous
research
.
We
conclude
in
section
5
.
2
Related
Work
Most
previous
work
using
distributional
approaches
to
compositionality
either
contrasts
distributional
information
of
candidate
phrases
with
constituent
words
(
Schone
and
Jurafsky
,
2001
;
Bannard
et
al.
,
or
uses
distributionally
similar
words
to
detect
nonproductive
phrases
(
Lin
,
1999
)
.
Lin
(
1999
)
used
his
method
(
Lin
,
1998
)
for
automatic
thesaurus
construction
.
He
identified
candidate
phrases
involving
several
open-class
words
output
from
his
parser
and
filtered
these
by
the
log-likelihood
statistic
.
Lin
proposed
that
if
there
is
a
phrase
obtained
by
substitution
of
either
the
head
or
modifier
in
the
phrase
with
a
'
nearest
neighbour
'
from
the
thesaurus
then
the
mutual
information
of
this
and
the
original
phrase
must
be
significantly
different
for
the
original
phrase
to
be
considered
non-compositional
.
He
evaluated
the
output
manually
.
As
well
as
distributional
similarity
,
researchers
have
used
a
variety
of
statistics
as
indicators
of
non-compositionality
(
Blaheta
and
Johnson
,
2001
;
Krenn
and
Evert
,
2001
)
.
Fazly
and
Stevenson
(
2006
)
use
statistical
measures
of
syntactic
behaviour
to
gauge
whether
a
verb
and
noun
combination
is
likely
to
be
a
idiom
.
Although
they
are
not
specifically
detecting
compositionality
,
there
is
a
strong
corre
-
lation
between
syntactic
rigidity
and
semantic
idiosyncrasy
.
Venkatapathy
and
Joshi
(
2005
)
combine
different
statistical
and
distributional
methods
using
support
vector
machines
(
svms
)
for
identifying
non-compositional
verb-object
combinations
.
They
explored
seven
features
as
measures
of
compositional-ity
:
2
.
pointwise
mutual
information
(
Church
and
3
.
least
mutual
information
difference
with
similar
collocations
,
based
on
(
Lin
,
1999
)
and
using
Lin
's
thesaurus
(
Lin
,
1998
)
for
obtaining
the
similar
collocations
.
The
distributed
frequency
of
an
object
,
which
takes
an
average
of
the
frequency
of
occurrence
with
an
object
over
all
verbs
occurring
with
the
object
above
a
threshold
.
5
.
distributed
frequency
of
an
object
,
using
the
verb
,
which
considers
the
similarity
between
the
target
verb
and
the
verbs
occurring
with
the
target
object
above
the
specified
threshold
.
7
.
the
same
lsa
approach
,
but
considering
the
similarity
of
the
verb-object
pair
with
the
verbal
form
of
the
object
(
to
capture
support
verb
constructions
e.g.
give
a
smile
We
say
more
about
this
dataset
and
Venkatapathy
and
Joshi
's
results
in
section
4
since
we
use
the
dataset
for
our
experiments
.
In
this
paper
,
we
investigate
the
use
of
selec-tional
preferences
to
detect
compositionality
.
Bannard
(
2002
)
did
some
pioneering
work
to
try
and
establish
a
link
between
the
compositionality
of
verb
particle
constructions
and
the
selectional
preferences
of
the
multiword
and
its
constituent
verb
.
His
results
were
hampered
by
models
based
on
(
Li
and
Abe
,
1998
)
which
involved
rather
uninformative
models
at
the
roots
of
WordNet
.
There
are
several
reasons
for
this
.
The
classes
for
the
model
are
selected
using
mdl
by
compromising
between
a
simple
model
with
few
classes
and
one
which
explains
the
data
well
.
The
models
are
particularly
affected
by
the
quantity
of
data
available
(
Wagner
,
2002
)
.
Also
noise
from
frequent
but
idiosyncratic
or
polysemous
arguments
weakens
the
signal
.
There
is
scope
for
experimenting
with
other
approaches
such
as
(
Clark
and
Weir
,
2002
)
,
however
,
we
feel
a
type-based
approach
is
worthwhile
to
avoid
the
noise
introduced
from
frequent
but
polysemous
arguments
and
bias
from
highly
frequent
arguments
which
might
be
part
of
a
multiword
rather
than
a
prototypical
argument
of
the
predicate
in
question
,
for
example
eat
hat
.
In
contrast
to
Bannard
,
our
experiments
are
with
verb-object
combinations
rather
than
verb
particle
constructions
.
We
compare
Li
and
Abe
models
with
WordNet
models
which
use
the
number
of
argument
types
to
obtain
the
classes
for
representation
of
the
selectional
preferences
.
In
addition
to
experiments
with
these
WordNet
models
,
we
propose
models
using
entries
in
distributional
the-sauruses
for
representing
preferences
.
3
Three
Methods
for
Acquiring
Selectional
Preferences
All
models
were
acquired
from
verb-object
data
extracted
using
the
rasp
parser
(
Briscoe
and
Carroll
,
2002
)
from
the
90
million
words
of
written
English
from
the
bnc
(
Leech
,
1992
)
.
We
extracted
verb
and
common
noun
tuples
where
the
noun
is
the
argument
head
of
the
object
relation
.
The
parser
was
also
used
to
extract
the
grammatical
relation
data
used
for
acquisition
of
the
thesaurus
described
below
in
section
3.3
.
This
approach
is
a
reimplementation
of
Li
and
Abe
(
1998
)
.
Each
selectional
preference
model
(
referred
to
as
a
tree
cut
model
,
or
tcm
)
comprises
a
set
of
disjunctive
noun
classes
selected
from
all
the
possibilities
in
the
WordNet
hyponym
hierarchy
3
using
mdl
(
Rissanen
,
1978
)
.
The
tcm
covers
all
the
3We
use
WordNet
version
2.1
for
the
work
in
this
paper
.
noun
senses
in
the
WordNet
hierarchy
and
is
associated
with
a
probability
distribution
over
these
noun
senses
in
the
hierarchy
reflecting
the
argument
head
data
occurring
in
the
given
grammatical
relationship
with
the
specified
verb
.
mdl
finds
the
classes
in
the
tcm
by
considering
the
cost
measured
in
bits
of
describing
both
the
model
and
the
argument
head
data
encoded
in
the
model
.
A
compromise
is
made
by
having
as
simple
a
model
as
possible
using
classes
further
up
the
hierarchy
whilst
also
providing
a
good
model
for
the
set
of
argument
head
tokens
(
TK
)
.
The
classes
are
selected
by
recursing
from
the
top
of
the
WordNet
hierarchy
comparing
the
cost
(
or
description
length
)
of
using
the
mother
class
to
the
cost
of
using
the
hyponym
daughter
classes
.
In
any
path
,
the
mother
is
preferred
unless
using
the
daughters
would
reduce
the
cost
.
If
using
the
daughters
for
the
model
is
less
costly
than
the
mother
then
the
recursion
continues
to
compare
the
cost
of
the
hyponyms
beneath
.
The
cost
(
or
description
length
)
for
a
set
of
classes
is
calculated
as
the
model
description
length
(
mdl
)
and
the
data
description
length
(
ddl
)
k
,
is
the
number
of
WordNet
classes
being
currently
considered
for
the
tcm
minus
one
.
The
mdl
method
uses
the
size
of
TK
on
the
assumption
that
a
larger
dataset
warrants
a
more
detailed
model
.
The
cost
of
describing
the
argument
head
data
is
calculated
using
the
log
of
the
probability
estimate
from
the
classes
currently
being
considered
for
the
model
.
The
probability
estimate
for
a
class
being
considered
for
the
model
is
calculated
using
the
cumulative
frequency
of
all
the
hyponym
nouns
under
that
class
that
occur
in
TK
,
divided
by
the
number
of
noun
senses
that
these
nouns
have
,
to
account
for
their
polysemy
.
This
cumulative
frequency
is
also
divided
by
the
total
number
of
noun
hyponyms
under
that
class
in
WordNet
to
obtain
a
smoothed
estimate
for
all
nouns
under
the
class
.
The
probability
of
the
class
is
obtained
by
dividing
this
frequency
estimate
by
the
total
frequency
of
the
argument
heads
.
The
algorithm
is
described
fully
by
Li
and
Abe
(
1998
)
.
Example
nouns
mile
way
street
lane
elf-propell
venicfe
location
4See
(
Li
and
Abe
,
1998
)
for
a
full
explanation
.
Figure
1
:
portion
of
the
tcm
for
the
objects
of
park
.
A
small
portion
of
the
tcm
for
the
object
slot
of
park
is
shown
in
figure
1
.
WordNet
classes
are
displayed
in
boxes
with
a
label
which
best
reflects
the
meaning
of
the
class
.
The
probability
estimates
are
shown
for
the
classes
on
the
tcm.
Examples
of
the
argument
head
data
are
displayed
below
the
WordNet
classes
with
dotted
lines
indicating
membership
at
a
hyponym
class
beneath
these
classes
.
We
cannot
show
the
full
tcm
due
to
lack
of
space
,
but
we
show
some
of
the
higher
probability
classes
which
cover
some
typical
nouns
that
occur
as
objects
of
park
.
Note
that
probability
under
the
classes
abstract-entity
,
way
and
location
arise
because
of
a
systematic
parsing
error
where
adverbials
such
as
distance
in
park
illegally
some
distance
from
the
railway
station
are
identified
by
the
parser
as
objects
.
Systematic
noise
from
the
parser
has
an
impact
on
all
the
selectional
preference
models
described
in
this
paper
.
We
propose
a
method
of
acquiring
selectional
preferences
which
instead
of
covering
all
the
noun
senses
in
WordNet
,
just
gives
a
probability
distribution
over
a
portion
of
prototypical
classes
,
we
refer
to
these
models
as
wnprotos
.
A
wnproto
consists
of
classes
within
the
noun
hierarchy
which
have
the
highest
proportion
of
word
types
occurring
in
the
argument
head
data
,
rather
than
using
the
number
of
tokens
,
or
frequency
,
as
is
used
for
the
tcms.
This
allows
less
frequent
,
but
potentially
informative
arguments
to
have
some
bearing
on
the
models
acquired
to
reduce
the
impact
of
highly
frequent
but
polysemous
arguments
.
We
then
used
the
frequency
data
to
populate
these
selected
classes
.
distance
The
classes
(
C
)
in
the
wnproto
are
selected
from
those
which
include
at
least
a
threshold
of
2
argument
head
types
5
occurring
in
the
training
data
.
Each
argument
head
in
the
training
data
is
disambiguated
according
to
whichever
of
the
WordNet
classes
it
occurs
at
or
under
which
has
the
highest
'
type
ratio
'
.
Let
TY
be
the
set
of
argument
head
types
in
the
object
slot
of
the
verb
for
which
we
are
acquiring
the
preference
model
.
The
type
ratio
for
a
class
(
c
)
is
the
ratio
of
noun
types
(
ty
G
TY
)
occurring
in
the
training
data
also
listed
at
or
beneath
that
class
in
WordNet
to
the
total
number
of
noun
types
listed
at
or
beneath
that
particular
class
in
WordNet
(
writy
G
c
)
.
The
argument
types
attested
in
the
training
data
are
divided
by
the
number
of
WordNet
classes
that
the
noun
(
classes
(
ty
)
)
belongs
to
,
to
account
for
polysemy
in
the
training
data
.
physical
entity
j
self-propelled
i_vehicle
j
transport
3
wheeled
!
vehicle
Example
nouns
hyponym
classes
classes
in
model
'
fiI
motor
pram
v
-----
^
-----
.
If
more
than
one
class
has
the
same
type
ratio
then
the
argument
is
not
used
for
calculating
the
probability
of
the
preference
model
.
In
this
way
,
only
arguments
that
can
be
disambiguated
are
used
for
calculating
the
probability
distribution
.
The
advantage
of
using
the
type
ratio
to
determine
the
classes
used
to
represent
the
model
and
to
disambiguate
the
arguments
is
that
it
prevents
high
frequency
verb
noun
combinations
from
masking
the
information
from
prototypical
but
low
frequency
arguments
.
We
wish
to
use
classes
which
are
as
representative
of
the
argument
head
types
as
possible
to
help
detect
when
an
argument
head
is
not
related
to
these
classes
and
is
therefore
more
likely
to
be
non-compositional
.
For
example
,
the
class
motor
.
vehicle
is
selected
for
the
wnproto
model
of
the
object
slot
of
park
even
though
there
are
5
meanings
of
car
in
WordNet
including
elevator_car
and
gondola
.
There
are
174
occurrences
of
car
which
overwhelms
the
frequency
of
the
other
objects
(
e.g.
van
11
,
vehicle
8
)
but
by
looking
for
classes
with
a
high
proportion
of
types
(
rather
than
word
tokens
)
car
is
disambiguated
appropriately
and
the
class
motor
.
vehicle
is
selected
for
representation
.
5We
have
experimented
with
a
threshold
of
3
and
obtained
similar
results
.
Figure
2
:
Part
of
wnproto
for
the
object
slot
of
park
The
relative
frequency
of
each
class
is
obtained
from
the
set
of
disambiguated
argument
head
tokens
and
used
to
provide
the
probability
distribution
over
this
set
of
classes
.
Note
that
in
wnproto
,
classes
can
be
subsumed
by
others
in
the
hyponym
hierarchy
.
The
probability
assigned
to
a
class
is
applicable
to
any
descendants
in
the
hyponym
hierarchy
,
except
those
within
any
hyponym
classes
within
the
wnproto
.
The
algorithm
for
selecting
C
and
calculating
the
probability
distribution
is
shown
as
Algorithm
1
.
Note
that
we
use
brackets
for
comments
.
In
figure
2
we
show
a
small
portion
of
the
wn-proto
for
park
.
Again
,
WordNet
classes
are
displayed
in
boxes
with
a
label
which
best
reflects
the
meaning
of
the
class
.
The
probability
estimates
are
shown
in
the
boxes
for
all
the
classes
included
in
the
wnproto
.
The
classes
in
the
wnproto
model
are
shown
with
dashed
lines
.
Examples
of
the
argument
head
data
are
displayed
below
the
WordNet
classes
with
dotted
lines
indicating
membership
at
a
hyponym
class
beneath
these
classes
.
We
cannot
show
the
full
wnproto
due
to
lack
of
space
,
but
we
show
some
of
the
classes
with
higher
probability
which
cover
some
typical
nouns
that
occur
as
objects
of
park
.
Algorithm
1
WNPROTO
algorithm
remove
c
from
C
{
classes
with
less
than
two
disambiguated
nouns
are
removed
}
end
if
end
for
Algorithm
2
DSPROTO
algorithm
fD
=
0
{
frequency
of
disambiguated
items
}
TY
=
argument
head
types
{
nouns
occurring
as
objects
of
verb
,
with
associated
frequencies
}
order
C1
by
num-types-in-thesaurus
(
cty
,
TY
)
{
classes
ordered
by
coverage
of
argument
head
types
}
for
all
cty
ordered
C1
do
add
ty
to
Dcty
{
types
disambiguated
to
this
class
only
if
not
disambiguated
by
a
class
used
already
}
end
if
end
for
p
(
cty
)
=
{
calculating
class
probabilities
}
We
use
a
thesaurus
acquired
using
the
method
proposed
by
Lin
(
1998
)
.
For
input
we
used
the
grammatical
relation
data
from
automatic
parses
of
the
BNC
.
For
each
noun
we
considered
the
co-occurring
verbs
in
the
object
and
subject
relation
,
the
modifying
nouns
in
noun-noun
relations
and
the
modifying
adjectives
in
adjective-noun
relations
.
Each
thesaurus
entry
consists
of
the
target
noun
and
the
50
most
similar
nouns
,
according
to
Lin
's
measure
of
distributional
similarity
,
to
the
target
.
The
argument
head
noun
types
(
TY
)
are
used
to
find
the
entries
in
the
thesaurus
as
the
'
classes
'
(
C
)
of
the
selectional
preference
for
a
given
verb
.
As
with
WNPROTOs
,
we
only
cover
argument
types
which
form
coherent
groups
with
other
argument
types
since
we
wish
i
)
to
remove
noise
and
ii
)
to
be
able
to
identify
argument
types
which
are
not
related
with
the
other
types
and
therefore
may
be
non-compositional
.
As
our
starting
point
we
only
consider
an
argument
type
as
a
class
for
C
if
its
entry
in
the
thesaurus
covers
at
least
a
threshold
of
2
types
.
To
select
C
we
use
a
best
first
search
.
This
method
processes
each
argument
type
in
TY
in
order
of
the
number
of
the
other
argument
types
from
TY
that
it
has
in
its
thesaurus
entry
of
50
similar
nouns
.
An
argument
head
is
selected
as
a
class
for
C
(
cty
e
C
)
7
if
it
covers
at
least
2
of
the
argument
heads
that
are
not
in
the
thesaurus
entries
of
any
of
the
other
classes
already
selected
for
C.
Each
argument
head
is
dis-ambiguated
by
whichever
class
in
C
under
which
it
is
listed
in
the
thesaurus
and
which
has
the
largest
number
of
the
TY
in
its
thesaurus
entry
.
When
the
algorithm
finishes
processing
the
ordered
argument
heads
to
select
C
,
all
argument
head
types
are
dis-ambiguated
by
C
apart
from
those
which
after
disambiguation
occur
in
isolation
in
a
class
without
other
argument
types
.
Finally
a
probability
distribution
over
C
is
estimated
using
the
frequency
(
tokens
)
of
argument
types
that
occur
in
the
thesaurus
entries
for
any
cty
e
C.
If
an
argument
type
occurs
in
the
entry
of
more
than
one
cty
then
it
is
assigned
to
whichever
of
these
has
the
largest
number
6As
with
the
wnprotos
,
we
experimented
with
a
value
of
3
for
this
threshold
and
obtained
similar
results
.
7We
use
cty
for
the
classes
of
the
DSproto
.
These
classes
are
simply
groups
of
nouns
which
occur
under
the
entry
of
a
noun
(
ty
)
in
the
thesaurus
.
disambiguated
objects
(
freq
)
car
(
174
)
van
(
11
)
vehicle
(
8
)
.
.
.
street
(
5
)
distance
(
4
)
mile
(
1
)
.
.
.
corner
(
4
)
lane
(
3
)
door
(
1
)
backside
(
2
)
bum
(
1
)
butt
(
1
)
.
.
.
Figure
3
:
First
four
classes
of
DSPROTO
model
for
park
of
disambiguated
argument
head
types
and
its
token
frequency
is
attributed
to
that
class
.
We
show
the
algorithm
as
Algorithm
2
.
The
algorithms
for
WNPROTO
algorithm
1
and
DSproto
(
algorithm
2
)
differ
because
of
the
nature
of
the
inventories
of
candidate
classes
(
Word-Net
and
the
distributional
thesaurus
)
.
There
are
a
great
many
candidate
classes
in
WordNet
.
The
WN-PROTO
algorithm
selects
the
classes
from
all
those
that
the
argument
heads
belong
to
directly
and
indirectly
by
looping
over
all
argument
types
to
find
the
class
that
disambiguates
each
by
having
the
largest
type
ratio
calculated
using
the
undisambiguated
argument
heads
.
The
DSPROTO
only
selects
classes
from
the
fixed
set
of
argument
types
.
The
algorithm
loops
over
the
argument
types
with
at
least
two
argument
heads
in
the
thesaurus
entry
and
ordered
by
the
number
of
undisambiguated
argument
heads
in
the
thesaurus
entry
.
This
is
a
best
first
search
to
minimise
the
number
of
argument
heads
used
in
C
but
maximise
the
coverage
of
argument
types
.
In
figure
3
,
we
show
part
of
a
DSproto
model
for
the
object
of
park
.
8
Note
again
that
the
class
mile
arises
because
of
a
systematic
parsing
error
where
adverbials
such
as
distance
in
park
illegally
some
distance
from
the
railway
station
are
identified
by
the
parser
as
objects
.
4
Experiments
Venkatapathy
and
Joshi
(
2005
)
produced
a
dataset
of
verb-object
pairs
with
human
judgements
of
com-positionality
.
They
obtained
values
of
rs
between
0.111
and
0.300
by
individually
applying
the
7
features
described
above
in
section
2
.
The
best
correlation
was
given
by
feature
7
and
the
second
best
was
feature
3
.
They
combined
all
7
features
using
SVMs
and
splitting
their
data
into
test
and
training
data
and
achieve
a
rs
of
0.448
,
which
demonstrates
8We
cannot
show
the
full
model
due
to
lack
of
space
.
significantly
better
correlation
with
the
human
gold-standard
than
any
of
the
features
in
isolation
We
evaluated
our
selectional
preference
models
using
the
verb-object
pairs
produced
by
Venkatapa-thy
and
Joshi
(
2005
)
.
9
This
dataset
has
765
verb-object
collocations
which
have
been
given
a
rating
between
1
and
6
,
by
two
annotators
(
both
fluent
speakers
of
English
)
.
Kendall
's
Tau
(
Siegel
and
Castellan
,
1988
)
was
used
to
measure
agreement
,
and
a
score
of
0.61
was
obtained
which
was
highly
significant
.
The
ranks
of
the
two
annotators
gave
a
Spearman
's
rank-correlation
coefficient
(
rs
)
of
0.71
.
The
Verb-Object
pairs
included
some
adjectives
(
e.g.
happy
,
difficult
,
popular
)
,
pronouns
and
complements
e.g.
become
director
.
We
used
the
subset
of
638
verb-object
pairs
that
involved
common
nouns
in
the
object
relationship
since
our
preference
models
focused
on
the
object
relation
for
common
nouns
.
For
each
verb-object
pair
we
used
the
preference
models
acquired
from
the
rasp
parses
of
the
BNC
to
obtain
the
probability
of
the
class
that
this
object
occurs
under
.
Where
the
object
noun
is
a
member
of
several
classes
(
classes
(
noun
)
e
C
)
in
the
model
,
the
class
with
the
largest
probability
is
used
.
Note
though
that
for
WNPROTOs
we
have
the
added
constraint
that
a
hyponym
class
from
C
is
selected
in
preference
to
a
hypernym
in
C.
Compo-sitionality
of
an
object
noun
and
verb
is
computed
as
:
-
We
use
the
probability
of
the
class
,
rather
than
an
estimate
of
the
probability
of
the
object
,
because
we
want
to
determine
how
likely
any
word
belonging
to
this
class
might
occur
with
the
given
verb
,
rather
than
the
probability
of
the
speciic
noun
which
may
be
infrequent
,
yet
typical
,
of
the
objects
that
occur
with
this
verb
.
For
example
,
convertible
may
be
an
infrequent
object
of
park
,
but
it
is
quite
likely
given
its
membership
of
the
class
motor
.
vehicle
.
We
do
not
want
to
assume
anything
about
the
frequency
of
non-compositional
verb-object
combinations
,
just
that
they
are
unlikely
to
be
members
of
classes
which
represent
prototypical
objects
.
9This
verb-object
dataset
is
available
from
http
:
/
/
www.cis.upenn.edu
/
~
sriramv
/
mywork.html
.
selectional
preferences
features
from
V
&amp;
J
frequency
(
fl
)
combination
with
SVM
Table
1
:
Correlation
scores
for
638
verb
object
pairs
will
contrast
these
models
with
a
baseline
frequency
feature
used
by
Venkatapathy
and
Joshi
.
We
use
our
selectional
preference
models
to
provide
the
probability
that
a
candidate
is
representative
of
the
typical
objects
of
the
verb
.
That
is
,
if
the
object
might
typically
occur
in
such
a
relationship
then
this
should
lessen
the
chance
that
this
verb-object
combination
is
non-compositional
.
We
used
the
probability
of
the
classes
from
our
3
selec-tional
preference
models
to
rank
the
pairs
and
then
used
Spearman
's
rank-correlation
coefficient
(
rs
)
to
compare
these
ranks
with
the
ranks
from
the
goldstandard
.
Our
results
for
the
three
types
of
preference
models
are
shown
in
the
first
section
of
table
1
.
10
All
the
correlation
values
are
signiicant
,
but
we
note
that
using
the
type
based
selectional
preference
models
achieves
a
far
greater
correlation
than
using
the
TCMs
.
The
DSproto
models
achieve
the
best
results
which
is
very
encouraging
given
that
they
only
require
raw
data
and
an
automatic
parser
to
obtain
the
grammatical
relations
.
10We
show
absolute
values
of
correlation
following
(
Venkat-apathy
and
Joshi
,
2005
)
.
11The
other
3
features
performed
less
well
on
this
dataset
so
we
do
not
report
the
details
here
.
This
seems
to
be
because
they
worked
particularly
well
with
the
adjective
and
pronoun
data
in
the
full
dataset
.
tained
using
the
same
bnc
dataset
used
by
Venkat-apathy
and
Joshi
which
was
obtained
using
Bikel
's
parser
(
Bikel
,
2004
)
.
We
obtained
correlation
values
for
these
features
as
shown
in
table
1
under
V
&amp;
J.
These
features
are
feature
1
frequency
,
feature
2
pointwise
mutual
information
,
feature
3
based
on
(
Lin
,
1999
)
and
feature
7
lsa
feature
which
considers
the
similarity
of
the
verb-object
pair
with
the
verbal
form
of
the
object
.
Pointwise
mutual
information
did
surprisingly
well
on
this
84
%
subset
of
the
data
,
however
the
DSproto
preferences
still
outperformed
this
feature
.
We
combined
the
DSproto
and
V
&amp;
J
features
with
an
SVM
ranking
function
and
used
10
fold
cross
validation
as
Venkatapathy
and
Joshi
did
.
We
contrast
the
result
with
the
V
&amp;
J
features
without
the
preference
models
.
The
results
in
the
bottom
section
of
table
1
demonstrate
that
the
preference
models
can
be
combined
with
other
features
to
produce
optimal
results
.
5
Conclusions
and
Directions
for
Future
Work
We
have
demonstrated
that
the
selectional
preferences
of
a
verbal
predicate
can
be
used
to
indicate
if
a
speciic
combination
with
an
object
is
non-compositional
.
We
have
shown
that
selectional
preference
models
which
represent
prototypical
arguments
and
focus
on
argument
types
(
rather
than
tokens
)
do
well
at
the
task
.
Models
produced
from
distributional
thesauruses
are
the
most
promising
which
is
encouraging
as
the
technique
could
be
applied
to
a
language
without
a
man-made
thesaurus
.
We
ind
that
the
probability
estimates
from
our
models
show
a
highly
signiicant
correlation
,
and
are
very
promising
for
detecting
non-compositional
verb-object
pairs
,
in
comparison
to
individual
features
used
previously
.
Further
comparison
of
wnprotos
and
DSprotos
to
other
WordNet
models
are
warranted
to
contrast
the
effect
of
our
proposal
for
disambiguation
using
word
types
with
iterative
approaches
,
particularly
those
of
Clark
and
Weir
(
2002
)
.
A
benefit
of
the
DSprotos
is
that
they
do
not
require
a
hand-crafted
inventory
.
It
would
also
be
worthwhile
comparing
the
use
of
raw
data
directly
,
both
from
the
bnc
and
from
google
's
Web
1T
corpus
(
Brants
and
Franz
,
2006
)
since
web
counts
have
been
shown
to
outperform
the
Clark
and
Weir
models
on
a
pseudo-disambiguation
task
(
Keller
and
Lapata
,
2003
)
.
We
believe
that
preferences
should
NOT
be
used
in
isolation
.
Whilst
a
low
preference
for
a
noun
may
be
indicative
of
peculiar
semantics
,
this
may
not
always
be
the
case
,
for
example
chew
the
fat
.
Certainly
it
would
be
worth
combining
the
preferences
with
other
measures
,
such
as
syntactic
ixed-ness
(
Fazly
and
Stevenson
,
2006
)
.
We
also
believe
it
is
worth
targeting
features
to
speciic
types
of
constructions
,
for
example
light
verb
constructions
undoubtedly
warrant
special
treatment
(
Stevenson
et
The
selectional
preference
models
we
have
proposed
here
might
also
be
applied
to
other
tasks
.
We
hope
to
use
these
models
in
tasks
such
as
diathesis
alternation
detection
(
McCarthy
,
2000
;
Tsang
and
Stevenson
,
2004
)
and
contrast
with
WordNet
models
previously
used
for
this
purpose
.
6
Acknowledgements
We
acknowledge
support
from
the
Royal
Society
UK
for
a
Dorothy
Hodgkin
Fellowship
to
the
first
author
.
We
thank
the
anonymous
reviewers
for
their
constructive
comments
on
this
work
.
