In
this
paper
,
we
consider
the
computational
modelling
of
human
plausibility
judgements
for
verb-relation-argument
triples
,
a
task
equivalent
to
the
computation
of
selectional
preferences
.
Such
models
have
applications
both
in
psycholinguistics
and
in
computational
linguistics
.
By
extending
a
recent
model
,
we
obtain
a
completely
corpus-driven
model
for
this
task
which
achieves
significant
correlations
with
human
judgements
.
It
rivals
or
exceeds
deeper
,
resource-driven
models
while
exhibiting
higher
coverage
.
Moreover
,
we
show
that
our
model
can
be
combined
with
deeper
models
to
obtain
better
predictions
than
from
either
model
alone
.
1
Introduction
One
fundamental
and
intuitive
finding
in
experimental
psycholinguistics
is
that
humans
judge
the
plausibility
of
a
verb-argument
pair
vastly
differently
depending
on
the
semantic
relation
in
the
pair
.
Table
1
lists
example
human
judgements
which
McRae
et
al.
(
1998
)
elicited
by
asking
about
the
plausibility
of
,
e.g.
,
a
hunter
shooting
(
relation
agent
)
or
being
shot
(
relation
patient
)
.
McRae
et
al.
found
that
"
hunter
"
is
judged
to
be
a
very
plausible
agent
of
"
shoot
"
and
an
implausible
patient
,
while
the
reverse
is
true
for
"
deer
"
.
In
linguistics
,
this
phenomenon
is
explained
by
selectional
preferences
on
verbs
'
argument
positions
;
we
use
plausibility
and
fit
with
selectional
preferences
interchangeably
.
Relation
Plausibility
In
this
paper
,
we
consider
computational
models
that
predict
human
plausibility
ratings
,
or
the
fit
of
selectional
preferences
and
argument
,
for
such
(
verb
,
relation
,
argument
)
,
in
short
,
(
v
,
r
,
a
)
,
triples
.
Being
able
to
model
this
type
of
data
is
relevant
in
a
number
of
ways
.
From
the
point
of
view
of
psycholinguistics
,
selectional
preferences
have
an
important
effect
in
human
sentence
processing
(
e.g.
,
McRae
et
al.
(
1998
)
,
Trueswell
et
al.
(
1994
)
)
,
and
models
of
selectional
preferences
are
therefore
necessary
to
inform
models
of
this
process
(
Pado
et
al.
,
2006
)
.
In
computational
linguistics
,
a
multitude
of
tasks
is
sensitive
to
selectional
preferences
,
such
as
the
resolution
of
ambiguous
attachments
(
Hindle
and
Rooth
,
1993
)
,
word
sense
disambiguation
(
McCarthy
and
Carroll
,
2003
)
,
semantic
role
labelling
(
Gildea
and
Jurafsky
,
2002
)
,
or
testing
the
applicability
of
inference
rules
(
Pantel
et
al.
,
2007
)
.
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
400-409
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
generally
share
two
problems
:
(
a
)
,
limited
coverage
;
and
(
b
)
,
the
resource
(
at
least
partially
)
predetermines
the
generalisations
that
they
can
make
.
In
this
paper
,
we
investigate
whether
it
is
possible
to
predict
the
plausibility
of
(
v
,
r
,
a
)
triples
in
a
completely
corpus-driven
way
.
We
build
on
a
recent
selectional
preference
model
(
Erk
,
2007
)
that
bases
its
generalisations
on
word
similarity
in
a
vector
space
.
While
that
model
relies
on
corpora
with
semantic
role
annotation
,
we
show
that
it
is
possible
to
predict
plausibility
ratings
solely
on
the
basis
of
a
parsed
corpus
,
by
using
shallow
cues
and
a
suitable
vector
space
specification
.
For
evaluation
,
we
use
two
balanced
data
sets
of
human
plausibility
judgements
,
i.e.
,
datasets
where
each
verb
is
paired
both
with
a
good
agent
and
a
good
patient
,
and
where
both
nouns
are
presented
in
either
semantic
relation
(
as
in
Table
1
)
.
Using
balanced
test
data
is
a
particularly
difficult
task
,
since
it
forces
the
models
to
account
reliably
both
for
the
influence
of
the
semantic
relation
(
agent
/
patient
)
and
of
the
argument
head
(
"
hunter
"
/
"
deer
"
)
.
We
obtain
three
main
results
:
(
a
)
,
our
model
is
able
to
match
the
superior
performance
of
the
model
proposed
by
Pado
et
al.
(
2006
)
,
while
retaining
the
high
coverage
of
the
model
proposed
by
Resnik
(
1996
)
;
(
b
)
,
using
parsing
as
a
preprocessing
step
improves
the
model
's
performance
significantly
;
and
(
c
)
,
a
combination
of
our
model
with
the
Pado
model
exceeds
both
individual
models
in
accuracy
.
Plan
of
the
paper
.
In
Section
2
,
we
give
an
overview
of
existing
selectional
preferences
and
vector
space
models
.
Section
3
introduces
our
model
and
discusses
its
parameters
.
Sections
4
and
5
present
our
experimental
setup
and
results
.
Section
6
concludes
.
2
Related
Work
Modelling
Selectional
Preferences
with
Grammatical
Functions
.
The
idea
of
inducing
selec-tional
preferences
from
corpora
was
introduced
by
Resnik
(
1996
)
.
He
approximated
the
semantic
verbargument
relations
in
(
v
,
r
,
a
)
triples
by
grammatical
functions
,
which
are
readily
available
for
large
training
corpora
.
His
basic
two-step
procedure
was
followed
by
all
later
approaches
:
(
1
)
,
extract
argument
headwords
for
a
given
predicate
and
relation
from
a
corpus
;
(
2
)
,
generalise
to
other
,
similar
words
us
-
ing
the
WordNet
noun
hierarchy
.
Other
models
also
relying
on
the
WordNet
resource
include
Abe
and
Li
(
1996
)
and
Clark
and
Weir
(
2002
)
.
We
present
Resnik
's
model
in
some
detail
,
since
we
will
use
it
for
comparison
below
.
Resnik
first
computes
the
overall
selectional
preference
strength
for
each
verb-relation
pair
,
i.e.
the
degree
of
"
con-strainedness
"
of
each
relation
.
This
quantity
is
estimated
as
the
difference
(
in
terms
of
the
Kullback-Leibler
divergence
D
)
between
the
distribution
over
WordNet
argument
classes
given
the
relation
,
p
(
c
|
r
)
,
and
the
distribution
of
argument
classes
given
the
current
verb-relation
combination
,
p
(
clv
,
r
)
.
The
intuition
is
that
a
verb-relation
pair
that
only
allows
for
a
limited
range
of
argument
heads
will
have
a
probability
distribution
over
argument
classes
that
strongly
diverges
from
the
prior
distribution
.
Next
,
the
selectional
association
of
the
triple
,
A
(
v
,
r
,
c
)
,
is
computed
as
the
ratio
of
the
selectional
preference
strength
for
this
particular
class
,
divided
by
the
overall
selectional
preference
strength
of
the
verb-relation
pair
.
This
is
shown
in
Equation
1
.
Finally
,
the
selectional
preference
between
a
verb
,
a
relation
,
and
an
argument
head
is
taken
to
be
the
selectional
association
of
the
verb
and
relation
with
the
most
strongly
associated
WordNet
ancestor
class
of
the
argument
.
WordNet-based
approaches
however
face
two
problems
.
One
is
a
coverage
problem
due
to
the
limited
size
of
the
resource
(
see
the
task-based
evaluation
in
Gildea
and
Jurafsky
(
2002
)
)
.
The
other
is
that
the
shape
of
the
WordNet
hierarchy
determines
the
generalisations
that
the
models
make
.
These
are
not
always
intuitive
.
For
example
,
Resnik
(
1996
)
observes
that
(
answer
,
obj
,
tragedy
)
receives
a
high
preference
because
"
tragedy
"
in
WordNet
is
a
type
of
written
communication
,
which
is
a
preferred
argument
class
of
"
answer
"
.
Rooth
et
al.
(
1999
)
present
a
fundamentally
different
approach
to
selectional
preference
induction
which
uses
soft
clustering
to
form
classes
for
generalisation
and
does
not
take
recourse
to
any
hand-crafted
resource
.
We
will
argue
in
Section
6
that
our
model
allows
more
control
over
the
generalisations
made
.
Modelling
Selectional
Preferences
with
Thematic
Roles
.
Pado
et
al.
(
2006
)
present
a
deeper
model
for
the
plausibility
of
(
v
,
r
,
a
)
triples
that
approximates
the
relations
with
thematic
roles
.
It
estimates
the
selectional
preferences
of
a
verb-role
pair
with
a
generative
probability
model
that
equates
the
plausibility
of
a
(
v
,
r
,
a
)
triple
with
the
joint
probability
of
seeing
the
thematic
role
with
the
verb-argument
pair
.
In
addition
,
the
model
also
considers
the
verb
's
sense
s
and
the
grammatical
function
gf
of
the
argument
;
however
,
since
the
model
is
generative
,
it
can
make
predictions
even
when
not
all
variables
are
instantiated
.
The
final
model
is
shown
in
Equation
2
.
A
drawback
of
vector
space
models
is
the
difficulty
of
interpreting
what
some
degree
of
"
generic
semantic
similarity
"
between
two
target
words
means
in
linguistic
terms
.
In
particular
,
this
similarity
is
not
sensitive
to
selectional
preferences
over
specific
semantic
relations
,
and
thus
cannot
model
the
plausibility
data
we
are
interested
in
.
The
next
section
demonstrates
how
the
integration
of
ideas
from
se-lectional
preference
induction
makes
this
distinction
possible
.
3
The
Vector
Similarity
Model
:
Corpus-Based
Modelling
of
Plausibility
The
induction
of
this
model
from
the
FrameNet
corpus
of
semantically
annotated
training
data
(
Fillmore
et
al.
,
2003
)
encounters
a
serious
sparse
data
problem
,
which
is
approached
by
the
application
of
word-class-based
and
Good-Turing
re-estimation
smoothing
.
The
resulting
model
's
plausibility
predictions
are
significantly
correlated
to
human
judgements
,
but
because
of
the
use
of
verb-specific
thematic
roles
,
the
model
's
coverage
is
still
restricted
by
the
verb
coverage
of
the
training
corpus
.
Vector
Space
Models
.
Another
class
of
models
that
has
found
wide
application
in
lexical
semantics
is
the
family
of
vector
space
models
.
In
a
vector
space
model
,
each
target
word
is
represented
as
a
vector
,
typically
constructed
from
co-occurrence
counts
with
context
words
in
a
large
corpus
(
the
so-called
basis
elements
)
.
The
underlying
assumption
is
that
words
with
similar
meanings
occur
in
similar
contexts
,
and
will
be
assigned
similar
vectors
.
Thus
,
the
distance
between
the
vectors
of
two
target
words
,
as
given
by
some
distance
measure
(
e.g.
,
Cosine
or
Jaccard
)
,
is
a
measure
of
their
semantic
similarity
.
Vector
space
models
are
simple
to
construct
,
and
the
semantic
similarity
they
provide
has
found
a
wide
range
of
applications
.
Examples
in
NLP
include
information
retrieval
(
Salton
et
al.
,
1975
)
,
automatic
thesaurus
extraction
(
Grefenstette
,
1994
)
,
and
predominant
sense
identification
(
McCarthy
et
al.
,
2004
)
.
In
cognitive
science
,
they
have
been
used
to
account
for
the
influence
of
context
on
human
lexical
processing
(
McDonald
and
Brew
,
2004
)
,
and
to
model
lexical
priming
(
Lowe
and
McDonald
,
2000
)
.
Our
model
builds
on
the
architecture
of
Erk
(
2007
)
.
It
combines
the
idea
underlying
the
selectional
preference
models
from
Section
2
,
namely
to
predict
plausibility
by
generalising
over
head
words
,
with
vector
space
similarity
.
The
fundamental
idea
of
our
model
is
to
model
the
plausibility
of
the
triple
(
v
,
r
,
a
)
by
comparing
the
argument
head
a
to
other
headwords
a
'
which
we
have
already
seen
in
a
corpus
for
the
same
verb-relation
pair
(
v
,
r
)
,
and
which
we
therefore
assume
to
be
plausible
.
We
write
Seenr
(
v
)
for
the
set
of
seen
headwords
.
Our
intuition
is
that
if
a
is
similar
to
the
words
in
Seenr
(
v
)
,
then
the
triple
(
v
,
r
,
a
)
is
plausible
;
conversely
,
if
it
is
very
dissimilar
,
then
the
triple
is
implausible
.
Concretely
,
we
judge
the
plausibility
of
the
triple
by
averaging
over
the
similarity
of
the
vector
for
a
to
all
vectors
for
the
seen
headwords
in
Seenr
(
v
)
:
where
w
is
a
weight
factor
specific
to
each
a
'
.
w
can
be
used
to
implement
different
weighting
schemes
that
encode
prior
knowledge
,
e.g.
,
about
the
reliability
of
different
words
in
Seenr
(
v
)
.
In
this
paper
,
we
only
consider
a
very
simple
weighting
factor
,
namely
the
frequency
of
the
seen
headwords
.
This
encodes
the
assumption
that
similarity
to
frequent
head
words
is
more
important
than
similarity
to
infrequent
ones
:
seen
patients
of
"
shoot
"
seen
agents
director
Figure
1
:
A
vector
space
for
estimating
the
plausibilities
of
(
shoot
,
agent
,
hunter
)
and
(
shoot
,
patient
,
hunter
)
.
This
model
can
be
seen
as
a
straightforward
implementation
of
the
selectional
preference
induction
process
of
generalising
from
seen
headwords
to
other
,
similar
words
.
By
using
vector
space
representations
to
judge
the
similarity
of
words
,
we
obtain
a
completely
corpus-driven
model
that
does
not
require
any
additional
resources
and
is
very
flexible
.
A
complementary
view
on
this
model
is
as
a
generalisation
of
traditional
vector
space
models
that
computes
similarity
not
between
two
vectors
,
but
between
a
vector
and
a
set
of
other
vectors
.
By
using
the
vectors
for
seen
headwords
of
a
given
relation
as
this
set
,
the
similarity
we
compute
is
specific
to
this
relation
.
Example
.
Figure
1
shows
an
example
vector
space
.
Consider
v
=
"
shoot
"
,
r
=
agent
,
and
a
=
"
hunter
"
.
In
order
to
judge
whether
a
hunter
is
a
plausible
agent
of
"
shoot
"
,
the
vector
space
representation
of
"
hunter
"
is
compared
to
all
representations
of
known
agents
of
"
shoot
"
,
namely
"
poacher
"
and
"
director
"
.
Due
to
the
nearness
of
the
vector
for
"
hunter
"
to
these
two
vectors
,
"
hunter
"
will
be
judged
a
fairly
good
agent
of
"
shoot
"
.
Compare
this
with
the
result
for
the
role
patient
:
"
hunter
"
is
further
away
from
"
lion
"
and
"
deer
"
,
and
will
therefore
be
found
to
be
a
rather
bad
patient
of
"
shoot
"
.
However
,
"
hunter
"
is
still
more
plausible
as
a
patient
of
"
shoot
"
than
e.g.
,
"
director
"
.
3.2
Instantiating
the
Model
:
Unparsed
vs.
Parsed
Corpora
The
two
major
tasks
which
need
to
be
addressed
to
obtain
an
instance
of
this
model
are
(
a
)
,
determining
the
sets
of
seen
head
words
Seenr
(
v
)
,
and
(
b
)
,
the
construction
of
a
vector
space
.
Erk
(
2007
)
extracted
the
set
of
seen
head
words
from
corpora
with
semantic
role
annotation
,
and
used
only
a
single
vector
space
representation
.
In
this
paper
,
we
eliminate
the
reliance
on
special
annotation
by
considering
shallow
approximations
of
the
semantic
relations
in
question
.
In
addition
,
we
discuss
in
detail
which
properties
of
the
vector
space
are
crucial
for
the
prediction
of
plausibility
ratings
,
a
much
more
fine-grained
task
than
the
pseudo-word
disambiguation
task
presented
in
Erk
(
2007
)
that
is
more
closely
related
to
semantic
role
labelling
.
The
goal
of
our
exposition
is
thus
to
develop
a
model
that
can
use
more
training
data
,
and
represent
the
corpus
information
optimally
in
order
to
obtain
superior
coverage
.
In
fact
,
tasks
(
a
)
and
(
b
)
can
be
solved
on
the
basis
of
unparsed
corpora
,
but
we
would
expect
the
results
to
be
rather
noisy
.
Fortunately
,
the
state
of
the
art
in
broad-coverage
(
Lin
,
1993
)
and
unsupervised
(
Klein
and
Manning
,
2004
)
dependency
parsing
allows
us
to
treat
dependency
parsing
merely
as
a
preprocessing
step
.
We
therefore
describe
two
instantiations
of
our
model
:
one
based
on
an
unprocessed
corpus
,
and
one
based
on
a
dependency-based
parsed
corpus
.
By
comparing
the
models
,
we
can
gauge
whether
syntactic
preprocessing
improves
model
performance
.
In
the
following
,
we
describe
the
strategies
the
two
models
adopt
for
(
a
)
and
(
b
)
.
Identifying
seen
head
words
for
relations
.
Recall
that
the
set
Seenr
(
v
)
is
supposed
to
contain
known
head
words
a
that
are
observed
in
the
corpus
as
triples
(
v
,
r
,
a
)
.
In
a
parsed
corpus
,
we
can
approximate
the
relation
agent
by
the
dependency
relation
of
subject
provided
by
the
parser
,
and
the
relation
patient
by
the
dependency
relation
of
object
.
In
an
unparsed
corpus
,
these
grammatical
relations
are
unavailable
,
and
the
only
straightforward
evidence
we
can
use
is
word
order
.
In
this
case
,
we
assume
that
words
directly
adjacent
to
the
left
of
a
predicate
are
subjects
,
and
therefore
agents
,
whereas
words
directly
to
its
right
are
objects
,
and
thus
patients
.
Vector
space
topology
.
The
success
of
our
method
depends
directly
on
the
topology
of
the
vector
space
.
More
specifically
,
two
words
should
only
be
assigned
similar
vectors
if
they
are
in
fact
of
similar
plausibility
.
If
this
is
not
the
case
,
there
is
no
guarantee
that
a
word
a
that
is
similar
to
the
words
in
Seenr
(
v
)
forms
shoot
escape
shoot-SUFJ
shoot-OFJ
escape-SUFJ
escape-OFJ
Figure
2
:
Two
vector
spaces
,
using
as
basis
elements
either
context
words
(
above
)
or
words
paired
with
grammatical
functions
(
below
)
a
plausible
triple
(
v
,
r
,
a
)
itself
(
cf.
Figure
1
)
.
The
topology
,
in
turn
,
is
related
to
the
choice
of
basis
elements
.
Traditional
vector
space
models
use
context
words
as
basis
elements
of
the
space
.
The
top
table
in
Figure
2
illustrates
our
intuition
that
such
spaces
are
problematic
:
"
deer
"
and
"
hunter
"
receive
identical
vectors
,
even
though
they
show
complementary
plausibility
ratings
(
cf.
Table
1
)
.
The
reason
is
that
"
deer
"
and
"
hunter
"
often
co-occur
quite
closely
to
one
another
(
e.g.
,
in
the
vicinity
of
"
shoot
"
)
,
and
thus
show
a
very
similar
profile
in
terms
of
context
words
.
In
preliminary
experiments
,
we
found
that
vector
spaces
with
context
words
as
basis
elements
are
in
fact
unable
to
distinguish
such
word
pairs
reliably
.
In
contrast
,
the
bottom
table
in
Figure
2
indicates
that
this
problem
can
be
alleviated
by
using
context
words
combined
with
the
grammatical
relation
to
the
target
word
as
basis
elements
.
Target
words
now
receive
different
representations
,
depending
on
the
grammatical
function
in
which
they
occur
with
context
words
.
In
consequence
,
resulting
spaces
can
distinguish
,
for
example
,
between
"
hunter
"
and
"
deer
"
.
We
adopt
word-function
pairs
as
basis
elements
for
the
vector
spaces
in
all
our
models
.
In
a
dependency-parsed
corpus
,
the
basis
elements
can
be
directly
read
off
the
syntactic
structure
.
In
an
unparsed
corpus
,
we
again
fall
back
on
word
order
,
appending
to
each
context
word
its
relative
position
to
the
target
word
.
4
Experimental
Setup
Experimental
Materials
.
In
order
to
make
our
evaluation
comparable
to
the
earlier
modelling
study
by
Pado
et
al.
(
2006
)
,
we
present
evaluations
on
the
two
plausibility
judgement
datasets
used
there.1
The
first
dataset
consists
of
100
data
points
from
McRae
et
al.
(
1998
)
.
Our
example
in
Table
1
,
which
is
taken
from
this
dataset
,
demonstrates
its
balanced
structure
:
25
verbs
are
paired
with
two
arguments
and
two
relations
each
,
such
that
each
argument
is
highly
plausible
in
one
relation
,
but
implausible
in
the
other
.
The
resulting
distribution
of
ratings
is
thus
highly
bimodal
.
Models
can
only
reliably
predict
the
human
ratings
in
this
data
set
if
they
can
capture
the
difference
between
verb
argument
slots
as
well
as
as
between
individual
fillers
.
The
second
,
larger
dataset
is
less
strictly
balanced
,
since
its
triples
are
constructed
on
the
basis
of
corpus
co-occurrences
(
Pado
et
al.
,
2006
)
.
18
verbs
are
combined
with
the
three
most
frequent
subjects
and
objects
from
both
the
Penn
Treebank
and
the
FrameNet
corpus
.
Each
verb-argument
pair
was
rated
both
as
an
agent
and
as
a
patient
,
which
leads
to
a
total
of
24
rated
triples
per
verb
.
The
dataset
contains
ratings
for
a
total
of
414
triples
,
due
to
overlap
between
corpora
.
The
resulting
judgements
show
a
more
even
distribution
of
ratings
than
the
McRae
data
.
Vector
Similarity
Models
.
Following
our
exposition
in
the
last
section
,
we
construct
two
instantiations
of
our
vector
similarity
model
,
one
using
un-parsed
and
one
parsed
data
.
Both
are
trained
on
the
complete
British
National
Corpus
(
Burnard
,
1995
,
FNC
)
with
more
than
six
million
sentences
.
The
unparsed
model
(
Unparsed
)
uses
the
BNC
without
any
pre-processing
.
We
first
construct
the
set
of
known
headwords
,
Seenr
(
v
)
,
as
follows
:
All
words
up
to
2
words
to
the
left
of
instances
of
v
are
assumed
to
be
subjects
,
and
thus
agents
;
vice
versa
for
patients
to
the
right
.
Then
,
we
construct
semantic
space
representations
for
the
experimental
arguments
and
known
headwords
,
adopting
optimal
parameter
settings
from
the
literature
(
Pado
and
Lap-ata
,
2007
)
.
This
means
a
context
window
of
5
words
to
either
side
and
2,000
basis
elements
(
dimensions
)
,
which
are
formed
by
the
most
frequent
1,000
words
1We
are
grateful
to
Ken
McRae
for
his
dataset
.
in
the
BNC
,
combined
with
each
of
the
relations
agent
and
patient
.
All
counts
are
log-likelihood
transformed
(
Lowe
,
2001
)
.
To
construct
the
parsed
model
(
Parsed
)
,
we
dependency-parsed
the
BNC
with
Minipar
(
Lin
,
1993
)
.
We
first
obtain
the
seen
headwords
Seenr
(
v
)
by
using
all
subjects
and
objects
of
v
as
agents
and
patients
,
respectively
.
We
then
construct
a
vector
space
for
the
experimental
arguments
and
known
headwords.2
We
use
2,000
dimensions
again
,
but
adopt
the
most
frequent
(
head
,
grammatical
function
)
pairs
in
the
BNC
as
basis
elements
.
The
context
window
is
formed
by
subject
and
object
dependencies
.
All
counts
are
log-likelihood
transformed
.
We
experiment
with
two
distance
measures
to
compute
vector
similarity
,
namely
the
Jaccard
Coefficient
and
Cosine
Distance
,
both
of
which
have
been
shown
to
yield
good
performance
in
NLP
tasks
(
Lee
,
1999
;
McDonald
and
Lowe
,
1998
)
.
Evaluation
Procedure
.
We
evaluate
our
models
by
correlating
the
predicted
plausibility
values
with
the
human
judgements
,
which
range
between
1
and
7
.
Since
the
human
judgement
data
is
not
normally
distributed
,
we
use
Spearman
's
p
,
a
non-parametric
rank-order
test
.
We
determine
the
statistical
significance
of
differences
in
correlation
strength
using
the
method
described
in
Raghunathan
(
2003
)
.
This
method
can
deal
with
missing
values
and
thus
allows
us
to
compare
models
with
different
coverage
.
It
is
difficult
to
specify
a
straightforward
baseline
for
our
correlation-based
evaluation
.
In
contrast
to
classification
tasks
,
where
models
choose
one
out
of
a
fixed
number
of
classes
,
our
model
predicts
continuous
data
.
This
task
is
more
difficult
to
approximate
,
e.g.
,
using
frequency
information
.
With
respect
to
upper
bounds
,
we
hold
that
automatic
models
of
plausibility
cannot
be
expected
to
surpass
the
typical
agreement
on
the
plausibility
judgement
task
between
human
participants
.
Thus
,
we
assume
an
upper
bound
of
p
w
0.7
.
Comparison
against
Other
Models
.
We
compare
our
performance
to
two
models
from
the
literature
discussed
in
Section
2
.
The
first
model
(
Pado
)
is
the
the
-
2This
space
was
computed
using
the
DependencyVectors
software
described
in
Pad6
and
Lapata
(
2007
)
.
This
software
can
be
downloaded
from
http
:
/
/
www.coli.uni-saarland.de
/
~
pado
/
dv.html.
matic
role-based
model
by
Pado
et
al.
(
2006
)
trained
on
the
FrameNet
(
Fillmore
et
al.
,
2003
)
release
1.2
example
sentences
,
a
subset
of
the
BNC
annotated
with
semantic
roles
.
This
corpus
contains
about
57,000
sentences
,
which
corresponds
to
roughly
1
%
of
the
BNC
data
.
The
second
model
(
Resnik
)
is
the
WordNet-based
selectional
preference
model
by
Resnik
(
1996
)
,
trained
on
the
dependency-parsed
BNC
(
see
above
)
.
5
Experimental
Evaluation
The
McRae
Dataset
.
Table
2
summarises
our
results
on
the
McRae
dataset
.
The
upper
part
shows
the
results
for
our
two
vector
similarity
models
(
Parsed
/
Unparsed
)
,
combined
with
the
two
distance
measures
(
Cosine
/
Jaccard
)
.
The
lower
part
shows
the
two
resource-based
models
we
use
for
comparison
.
We
find
that
all
vector
similarity
models
exhibit
high
coverage
(
above
90
%
)
,
and
one
model
(
Parsed
Cosine
)
can
predict
human
judgements
with
a
significant
correlation
.
The
instantiation
of
the
model
has
a
significant
impact
on
the
performance
:
The
Parsed
models
clearly
outperform
the
Unparsed
models
.
The
effect
of
the
distance
measure
is
less
clear-cut
,
since
the
Unparsed
models
perform
better
with
Jaccard
,
while
the
Parsed
models
prefer
Cosine
.
The
deep
semantic
plausibility
model
(
Pado
)
makes
predictions
only
for
slightly
more
than
half
of
the
data
.
This
low
coverage
is
a
direct
result
of
the
small
overlap
in
verbs
between
the
McRae
dataset
and
the
FrameNet
corpus
.
However
,
on
the
data
points
it
covers
,
it
achieves
a
significant
correlation
to
human
judgements
.
The
correlation
coefficient
is
numerically
much
higher
than
that
of
the
Parsed
Cosine
model
,
but
due
to
the
large
coverage
difference
,
the
two
models
are
not
statistically
distinguishable
.
Coverage
Spearman
's
p
Unparsed
Cosine
Unparsed
Jaccard
Parsed
Jaccard
Resnik
's
WordNet-based
model
shows
a
coverage
that
is
comparable
to
the
vector
similarity
models
,
but
does
not
achieve
a
significant
correlation
to
the
human
judgements
.
The
Pado
Dataset
.
Table
3
summarises
the
results
for
the
Pado
dataset
.
Since
all
verbs
in
this
dataset
are
covered
in
FrameNet
,
the
deep
Pado
model
shows
a
coverage
comparable
to
all
other
models
,
at
&gt;
95
%
.
The
main
difference
to
the
McRae
dataset
lies
in
the
models
'
performance
.
We
find
that
all
models
,
including
the
Unparsed
vector
models
and
Resnik
,
manage
to
achieve
significant
correlations
with
the
human
judgements
.
Within
the
vector
similarity
models
,
the
same
trends
hold
as
for
the
McRae
dataset
:
Parsed
outperforms
Unparsed
,
and
the
best
combination
is
Parsed
Cosine
.
The
models
fall
into
two
clearly
separated
groups
:
The
Pado
and
Parsed
Cosine
models
achieve
a
highly
significant
correlation
,
and
are
statistically
indistinguishable
.
They
significantly
outperform
the
second
group
(
p
&lt;
0.001
)
,
formed
by
all
other
models
.
Within
this
second
group
,
Resnik
is
numerically
the
best
model
and
shows
a
significant
correlation
with
human
data
;
nevertheless
,
the
difference
to
the
first
group
is
evident
from
its
substantially
lower
correlation
coefficient
.
The
construction
of
the
Pado
dataset
allows
a
further
analysis
.
As
mentioned
in
Section
4
,
the
dataset
consists
of
verb-argument
pairs
drawn
from
two
different
corpora
.
Therefore
,
each
verb
is
combined
both
with
some
arguments
that
are
seen
in
FrameNet
,
and
some
that
are
not
.
Our
hypothesis
is
that
the
FrameNet-trained
Pado
model
performs
considerably
better
on
the
216
"
FN-Seen
"
data
points
(
verb-argument
pairs
observed
in
FrameNet
in
at
least
one
relation
)
than
on
the
198
"
FN-Unseen
"
data
points
(
verb-argument
pairs
unseen
in
both
relations
)
.
Table
4
shows
the
results
of
this
analysis
for
the
best-performing
models
.
We
observe
a
pattern
corresponding
to
our
expectations
:
The
performance
of
the
Pado
model
is
clearly
worse
for
FN-Unseen
than
for
FN-Seen
,
while
the
Resnik
and
Parsed
Cosine
models
perform
more
evenly
across
both
datasets
.
While
the
Pado
model
is
significantly
better
on
the
FN-Seen
dataset
,
it
is
numerically
outperformed
by
the
Parsed
Cosine
model
for
the
FN-Unseen
data
points
.
We
conclude
that
the
deep
model
is
more
accurate
within
the
coverage
of
its
resources
,
but
loses
its
advantage
when
it
has
to
resort
to
smoothing
.
Model
combination
.
Our
last
analysis
indicates
that
the
models
have
complementary
strengths
:
the
thematic
role-based
Pado
model
is
the
best
plausibility
predictor
on
the
data
points
it
has
seen
,
while
the
Parsed
cosine
model
overall
predicts
human
data
only
numerically
worse
,
and
with
better
coverage
.
We
therefore
suggest
to
combine
the
predictions
of
the
two
models
to
combine
their
respective
strengths
.
For
the
moment
,
we
only
consider
a
naive
backoff
scheme
:
For
each
data
point
,
we
use
the
prediction
of
the
Pado
model
if
the
data
point
is
"
FN-Seen
"
(
cf.
the
last
paragraph
)
,
and
the
prediction
of
the
Parsed
Cosine
model
otherwise
.
Note
that
this
criterion
does
not
consider
the
predictions
of
the
models
themselves
,
only
properties
of
the
underlying
training
set
.
The
actual
combination
requires
a
normalisation
of
the
respective
predictions
,
since
one
of
the
models
(
Pado
)
is
probabilistic
,
while
the
other
one
(
Parsed
Cosine
)
is
similarity-based
,
and
their
predictions
are
not
directly
comparable
.
We
perform
a
simple
normalisation
by
z-transforming
the
complete
predictions
of
each
model.3
The
combination
of
the
scaled
predictions
in
fact
results
in
an
improved
correlation
with
the
human
data
.
The
correlation
coefficient
of
p
=
0.552
numerically
exceeds
either
base
model
,
and
the
coverage
of
98
%
corresponds
to
the
coverage
of
the
more
robust
Parsed
Cosine
model
.
We
take
this
result
as
evidence
that
even
a
simple
combination
technique
can
lead
to
improved
predictions
.
Unfortunately
,
our
naive
backoff
scheme
does
not
directly
carry
over
to
the
McRae
dataset
,
where
only
2
out
of
100
data
points
are
"
FN-Seen
"
,
and
the
Pado
model
would
thus
hardly
contribute
.
3The
z
transformation
scales
a
dataset
to
a
mean
of
0
and
a
standard
deviation
of
1
.
FN-Seen
Data
FN-Unseen
Data
Parsed
Cosine
Discussion
.
We
have
verified
experimentally
that
our
vector
similarity
model
is
able
to
match
the
performance
of
a
deep
plausibility
model
,
exceeding
it
in
coverage
,
and
to
outperform
a
WordNet-based
se-lectional
preference
model
.
We
conclude
that
a
completely
corpus-driven
approach
constitutes
a
viable
alternative
to
resource-based
models
.
One
insight
from
our
experiments
is
that
vector
similarity
models
constructed
from
dependency-parsed
corpora
perform
significantly
better
than
unparsed
models
.
This
indicates
that
dependency
relations
like
subject
and
object
are
reliable
syntactic
correlates
of
semantic
relations
like
agent
and
patient
,
but
that
their
approximation
in
terms
of
word
order
introduces
considerable
noise
.
The
Parsed
models
are
best
combined
with
Cosine
Distance
.
We
surmise
that
Cosine
,
which
tends
to
consider
low-frequency
words
more
than
Jaccard
,
is
more
susceptible
to
the
additional
noise
in
unparsed
corpora
.
Furthermore
,
the
choice
of
basis
elements
for
the
vector
space
is
vital
:
Plausibilities
could
only
be
predicted
successfully
with
word-relation
pairs
as
basis
elements
.
This
is
in
contrast
to
recent
results
on
predominant
sense
acquisition
,
the
task
of
identifying
the
most
frequent
sense
for
a
given
word
in
an
un-supervised
manner
(
McCarthy
et
al.
,
2004
)
.
On
that
task
,
Pado
and
Lapata
(
2007
)
found
vector
spaces
with
words
as
basis
elements
are
in
fact
competitive
with
models
using
word-relation
pairs
.
This
divergence
underlines
an
interesting
difference
between
the
two
tasks
.
Evidently
,
predominant
senses
identification
,
as
a
WSD-related
task
,
can
succeed
on
the
basis
of
topical
information
,
which
is
represented
well
in
word-based
spaces
.
In
contrast
,
plausibility
judgments
can
only
be
predicted
by
a
space
based
on
word-relation
pairs
which
can
represent
the
finer-grained
distinctions
arising
from
different
relations
between
verb
and
noun
.
A
second
important
finding
is
that
the
relative
performance
of
the
different
models
is
the
same
on
the
McRae
and
Pado
datasets
.
The
Pado
model
performs
best
,
followed
by
our
Parsed
Cosine
vector
similarity
model
,
followed
by
the
Unparsed
and
Resnik
models
.
The
McRae
dataset
,
however
,
is
much
more
difficult
to
account
for
than
the
Pado
data
,
independent
of
the
model
.
This
effect
was
already
noted
by
Pado
et
al.
(
2006
)
,
who
attributed
it
to
the
very
limited
overlap
between
the
McRae
dataset
and
FrameNet
.
While
this
explanation
can
account
for
the
difference
for
the
Pado
model
,
we
observe
the
same
pattern
across
all
models
.
This
suggests
that
a
more
general
frequency
effect
is
at
work
here
:
The
median
frequency
of
the
hand-selected
McRae
nouns
is
1,356
in
the
BNC
,
as
opposed
to
8,184
for
the
corpus-derived
Pado
nouns
.
The
resulting
sparseness
affects
all
model
families
,
since
all
ultimately
rely
on
co-occurrences
.
The
performance
difference
between
the
two
datasets
is
particularly
large
for
the
WordNet-based
selectional
preference
model
(
Resnik
)
.
A
further
analysis
of
the
model
's
predictions
shows
that
the
model
has
difficulty
in
distinguishing
between
verb-relation-argument
triples
that
differ
only
in
the
argument
,
such
as
(
shoot
,
agent
,
hunter
)
and
(
shoot
,
agent
,
deer
)
.
Recall
that
it
is
crucial
for
the
prediction
of
the
McRae
data
to
make
this
distinction
,
since
the
arguments
for
each
relation
are
chosen
to
differ
widely
in
plausibility
.
The
reason
for
the
Resnik
model
's
difficulty
is
that
arguments
are
mapped
onto
WordNet
synsets
,
and
whenever
two
arguments
are
mapped
onto
closely
related
synsets
,
their
plausibility
ratings
are
similar
.
This
problem
is
graver
for
the
McRae
test
set
,
where
all
arguments
are
animates
,
and
thus
more
similar
in
terms
of
WordNet
,
than
for
the
Pado
set
,
which
also
contains
a
portion
of
inanimate
arguments
with
animate
counterparts
.
This
analysis
highlights
again
the
fundamental
problem
of
resource-based
models
,
where
design
decisions
of
the
underlying
resource
may
limit
,
or
even
mislead
,
the
models
'
generalisations
.
Finally
,
we
have
shown
in
a
first
experiment
that
the
syntax-based
vector
similarity
model
can
be
combined
with
the
role-base
model
to
obtain
a
combined
model
that
performs
superior
to
both
.
In
this
combined
model
,
the
shallow
model
's
better
coverage
supplements
the
accurate
predictions
of
the
deep
model
.
6
Conclusions
In
this
paper
,
we
have
considered
the
computational
modelling
of
human
plausibility
judgements
for
verbrelation-argument
triples
,
a
task
equivalent
to
the
computation
of
selectional
preferences
.
We
have
extended
a
recent
proposal
(
Erk
,
2007
)
which
combines
ideas
from
selectional
preference
induction
and
vector
space
models
.
Our
model
can
be
constructed
from
a
large
corpus
with
partial
syntactic
information
(
specifically
,
subject
and
object
relations
)
from
which
it
builds
an
optimally
informative
vector
space
.
We
have
demonstrated
that
the
successful
evaluation
of
the
model
in
Erk
(
2007
)
on
the
coarse-grained
pseudo-word
disambiguation
task
carries
over
to
the
prediction
of
human
plausibility
judgments
which
requires
relatively
fine-grained
,
relation-based
distinctions
.
Our
model
is
competitive
with
existing
"
deep
"
models
while
exhibiting
a
higher
coverage
.
We
have
also
shown
that
our
vector
similarity
model
can
be
combined
with
a
"
deep
"
model
so
that
the
combined
model
outperforms
both
base
models
.
A
thorough
investigation
of
strategies
for
prediction
combination
and
scaling
remains
future
work
.
The
strategy
of
our
model
to
derive
generalisations
directly
from
corpus
data
,
without
recourse
to
resources
,
is
similar
to
another
family
of
corpus-driven
selectional
preference
models
,
namely
EM-based
clustering
models
(
Rooth
et
al.
,
1999
)
.
However
,
we
believe
that
our
model
has
a
number
of
advantages
.
(
1
)
,
It
is
conceptually
simple
and
implements
the
intuition
behind
selectional
preference
models
,
"
generalise
from
known
headwords
to
unknown
ones
"
,
particularly
directly
through
the
comparison
of
new
headwords
to
known
ones
according
to
a
given
definition
of
similarity
.
(
2
)
,
The
separation
of
the
similarity
computation
and
the
acquisition
of
seen
headwords
gives
the
experimenter
fine-grained
control
over
the
types
and
sources
of
information
which
inform
the
construction
of
the
model
.
(
3
)
,
The
instantiation
of
the
similarity
computation
with
a
vector
space
makes
it
possible
to
integrate
additional
linguistic
informa
-
tion
beyond
verb-argument
co-occurrences
into
the
model
,
building
on
a
large
body
of
work
in
vector
space
construction
.
In
sum
,
our
modular
model
provides
a
higher
degree
of
control
than
one-step
models
like
the
EM-based
proposal
.
An
important
avenue
of
further
research
is
the
ability
of
the
vector
plausibility
model
to
model
finer-grained
distinctions
between
semantic
relations
beyond
the
agent
/
patient
dichotomy
,
as
thematic
role-based
models
are
able
to
.
Excluding
the
direct
use
of
role-annotated
corpora
like
FrameNet
for
coverage
reasons
,
the
most
promising
strategy
is
to
extend
our
present
scheme
of
approximating
semantic
relations
by
grammatical
realisations
.
How
much
noise
this
approximation
introduces
when
finer
role
sets
are
used
is
an
open
research
question
.
Acknowledgments
.
The
work
presented
in
this
paper
was
supported
by
the
financial
support
of
DFG
(
grants
Pi-154
/
9-2
and
IRTG
"
Language
Technology
and
Cognitive
Systems
"
)
.
