A
parsing
system
returning
analyses
in
the
form
of
sets
of
grammatical
relations
can
obtain
high
precision
if
it
hypothesises
a
particular
relation
only
when
it
is
certain
that
the
relation
is
correct
.
We
operationalise
this
technique
—
in
a
statistical
parser
using
a
manually-developed
wide-coverage
grammar
of
English
—
by
only
returning
relations
that
form
part
of
all
analyses
licensed
by
the
grammar
.
We
observe
an
increase
in
precision
from
75
%
to
over
90
%
(
at
the
cost
of
a
reduction
in
recall
)
on
a
test
corpus
of
naturally-occurring
text
.
1
Introduction1
Head-dependent
relationships
(
possibly
labelled
with
a
relation
type
)
have
been
advocated
as
a
useful
level
of
representation
for
grammatical
structure
in
a
number
of
different
large-scale
language-processing
tasks
.
For
instance
,
in
recent
work
on
statistical
treebank
grammar
parsing
(
e.g.
Collins
,
1999
)
high
levels
of
accuracy
have
been
reached
using
lexicalised
probabilistic
models
over
head-dependent
tuples
.
Bouma
,
van
Noord
and
Mal-ouf
(
2001
)
create
dependency
treebanks
semi-auto-matically
in
order
to
induce
dependency-based
statistical
models
for
parse
selection
.
Lin
(
1998
)
,
Srinivas
(
2000
)
and
others
have
evaluated
the
accuracy
of
both
phrase
structure-based
and
dependency
parsers
by
matching
head-dependent
relations
against
'
gold
standard
'
relations
,
rather
than
matching
(
labelled
)
phrase
structure
bracketings
.
Research
on
unsupervised
acquisition
of
lexical
information
from
corpora
,
such
as
argument
structure
of
predicates
(
Briscoe
and
Carroll
,
1997
;
McCarthy
,
1A
previous
version
of
this
paper
was
presented
at
IWPT'01
;
this
version
contains
new
experiments
and
results
.
tuples
also
constitute
a
convenient
intermediate
representation
in
applications
such
as
information
extraction
(
Palmer
et
al.
,
1993
;
Yeh
,
2000
)
,
and
document
retrieval
on
the
Web
(
Grefenstette
,
1997
)
.
A
variety
of
different
approaches
have
been
taken
for
robust
extraction
of
relation
/
head
/
dependent
tuples
,
or
grammatical
relations
,
from
unrestricted
text
.
Dependency
parsing
is
a
natural
technique
to
use
,
and
there
has
been
some
work
in
that
area
on
robust
analysis
and
disambiguation
(
e.g.
Laf-ferty
,
Sleator
and
Temperley
,
1992
;
Srinivas
,
2000
)
.
Finite-state
approaches
(
e.g.
Karlsson
et
al.
,
1995
;
Ait-Mokhtar
and
Chanod
,
1997
;
Grefenstette
,
1998
)
have
used
hand-coded
transducers
to
recognise
linear
configurations
of
words
and
part
of
speech
labels
associated
with
,
for
example
,
subject
/
object-verb
relationships
.
An
intermediate
step
may
be
to
mark
nominal
,
verbal
etc.
'
chunks
'
in
the
text
and
to
identify
the
head
word
of
each
of
the
chunks
.
Statistical
finite-state
approaches
have
also
been
used
:
Brants
,
Skut
and
Krenn
(
1997
)
train
a
cascade
of
Hidden
Markov
Models
to
tag
words
with
their
grammatical
functions
.
Approaches
based
on
memory
based
learning
have
also
used
chunking
as
a
first
stage
,
before
assigning
grammatical
relation
labels
to
heads
of
chunks
(
Argamon
,
Dagan
and
Kry-molowski
,
1998
;
Buchholz
,
Veenstra
and
Daele-mans
,
1999
)
.
Blaheta
and
Charniak
(
2000
)
assume
a
richer
input
representation
consisting
of
labelled
trees
produced
by
a
treebank
grammar
parser
,
and
use
the
treebank
again
to
train
a
further
procedure
that
assigns
grammatical
function
tags
to
syntactic
constituents
in
the
trees
.
Alternatively
,
a
handwritten
grammar
can
be
used
that
produces
'
shallow
'
and
perhaps
partial
phrase
structure
analyses
from
which
grammatical
relations
are
extracted
(
e.g.
Carroll
,
Minnen
and
Briscoe
,
1998
;
Lin
,
1998
)
.
Recently
,
Schmid
and
Rooth
(
2001
)
have
described
an
algorithm
for
computing
expected
governor
labels
for
terminal
words
in
labelled
headed
parse
trees
produced
by
a
probabilistic
context-free
grammar
.
A
governor
label
(
implicitly
)
encodes
a
grammatical
relation
type
(
such
as
subject
or
object
)
and
a
governing
lexical
head
.
The
labels
are
expected
in
the
sense
that
each
is
weighted
by
the
sum
of
the
probabilities
of
the
trees
giving
rise
to
it
,
and
are
computed
efficiently
by
processing
the
entire
parse
forest
rather
than
individual
trees
.
The
set
of
terminal
/
relation
/
governing-head
tuples
will
not
typically
constitute
a
globally
coherent
analysis
,
but
may
be
useful
for
interfacing
to
applications
that
primarily
accumulate
fragments
of
grammatical
information
from
text
(
such
as
for
instance
information
extraction
,
or
systems
that
acquire
lexical
data
from
corpora
)
.
The
approach
is
not
so
suitable
for
applications
that
need
to
interpret
complete
and
consistent
sentence
structures
(
such
as
the
analysis
phase
of
transfer-based
machine
translation
)
.
Schmid
and
Rooth
have
implemented
the
algorithm
for
parsing
with
a
lexicalised
probabilistic
context-free
grammar
of
English
and
applied
it
in
an
open
domain
question
answering
system
,
but
they
do
not
give
any
practical
results
or
an
evaluation
.
In
the
paper
we
investigate
empirically
Schmid
and
Rooth
s
proposals
,
using
a
wide-coverage
parsing
system
applied
to
a
test
corpus
of
naturally-occurring
text
,
extend
it
with
various
thresholding
techniques
,
and
observe
the
trade-off
between
precision
and
recall
in
grammatical
relations
returned
.
Using
the
most
conservative
threshold
results
in
a
parser
that
returns
only
grammatical
relations
that
form
part
of
all
analyses
licensed
by
the
grammar
.
In
this
case
,
precision
rises
to
over
90
%
,
as
compared
with
a
baseline
of
75
%
.
2
The
Analysis
System
In
this
investigation
we
extend
a
statistical
shallow
parsing
system
for
English
developed
originally
by
Carroll
,
Minnen
and
Briscoe
(
1998
)
.
Briefly
,
the
system
works
as
follows
:
input
text
is
labelled
with
part-of-speech
(
PoS
)
tags
by
a
tagger
,
and
these
are
parsed
using
a
wide-coverage
unification-based
'
phrasal
grammar
of
English
PoS
tags
and
punctuation
.
For
disambiguation
,
the
parser
uses
a
probabilistic
LR
model
derived
from
parse
tree
structures
in
a
treebank
,
augmented
with
a
set
of
lexical
entries
for
verbs
,
acquired
automatically
from
a
10
million
word
sample
of
the
British
National
Corpus
(
Leech
,
1992
)
,
each
entry
containing
subcategori-sation
frame
information
and
an
associated
probability
.
The
parser
is
therefore
'
semi-lexicalised
'
in
that
verbal
argument
structure
is
disambiguated
lexically
,
but
the
rest
of
the
disambiguation
is
purely
structural
.
The
coverage
of
the
grammar
—
the
proportion
of
sentences
for
which
at
least
one
complete
spanning
analysis
is
found
—
is
around
80
%
when
applied
to
the
susanne
corpus
(
Sampson
,
1995
)
.
In
addition
,
the
system
is
able
to
perform
parse
failure
recovery
,
finding
the
highest
scoring
sequence
of
phrasal
fragments
(
following
the
approach
of
Kiefer
et
al.
,
1999
)
,
and
the
system
has
produced
at
least
partial
analyses
for
over
98
%
of
the
sentences
in
the
written
part
of
the
British
National
Corpus
.
The
parsing
system
reads
off
grammatical
relation
tuples
(
GRs
)
from
the
constituent
structure
tree
that
is
returned
from
the
disambiguation
phase
.
Information
is
used
about
which
grammar
rules
introduce
subjects
,
complements
,
and
modifiers
,
and
which
daughter
(
s
)
is
/
are
the
head
(
s
)
,
and
which
the
dependents
.
In
Carroll
et
al.
s
evaluation
the
system
achieves
GR
accuracy
that
is
comparable
to
published
results
for
other
systems
:
extraction
of
non-clausal
subject
relations
with
83
%
precision
,
compared
with
Grefenstette
's
(
1998
)
figure
of
80
%
;
and
overall
F-score2
of
unlabelled
head-dependent
pairs
of
80
%
,
as
opposed
to
Lin
's
(
1998
)
83
%
3
and
Srini-vas
's
(
2000
)
84
%
(
this
with
respect
only
to
binary
relations
,
and
omitting
the
analysis
of
control
relationships
)
.
Blaheta
and
Charniak
(
2000
)
report
an
F-score
of
87
%
for
assigning
grammatical
function
tags
to
constituents
,
but
the
task
,
and
therefore
the
scoring
method
,
is
rather
different
.
For
the
work
reported
in
this
paper
we
have
extended
Carroll
et
al.
s
basic
system
,
implementing
a
version
of
Schmid
and
Rooth
s
expected
governor
technique
(
see
section
1
above
)
but
adapted
for
unification-based
grammar
and
GR-based
analyses
.
Each
sentence
is
analysed
as
a
set
of
weighted
GRs
where
the
weight
associated
with
each
grammatical
relation
is
computed
as
the
sum
of
the
probabilities
of
the
parses
that
relation
was
derived
from
,
divided
by
the
sum
of
the
probabilities
of
all
parses
.
So
,
if
we
assume
that
Schmid
and
Rooth
's
example
sentence
Peter
reads
every
paper
on
markup
has
2
parses
,
one
where
on
markup
attaches
to
the
preceding
noun
having
overall
probability
and
the
other
where
it
has
verbal
attachment
with
probability
,
then
some
ofthe
weighted
GRs
would
be
2We
use
the
Fi
measure
defined
as
2
x
precision
x
.
3Our
calculation
,
based
on
table
2
of
Lin
(
1998
)
.
Figure
1
contains
a
more
extended
example
of
a
weighted
GR
analysis
for
a
short
sentence
from
the
susanne
corpus
,
and
also
gives
a
flavour
of
the
relation
types
that
the
system
returns
.
The
GR
scheme
is
decribed
in
detail
by
Carroll
,
Briscoe
and
Sanfil
-
ippo
(
1998
)
.
3
Empirical
Results
3.1
Weight
Thresholding
Our
first
experiment
compared
the
accuracy
of
the
parser
when
extracting
GRs
from
the
highest
ranked
analysis
(
the
standard
probabilistic
parsing
setup
)
against
extracting
weighted
GRs
from
all
parses
in
the
forest
.
To
measure
accuracy
we
use
the
precision
,
recall
and
F-score
measures
of
parser
GRs
against
'
gold
standard
'
GR
annotations
in
a
10,000-word
test
corpus
of
in-coverage
sentences
derived
from
the
susanne
corpus
and
covering
a
range
of
written
genres4
.
GRs
are
in
general
compared
using
an
equality
test
,
except
that
in
a
specific
,
limited
number
of
cases
(
described
by
Carroll
,
Minnen
and
Briscoe
,
1998
)
the
parser
is
allowed
to
return
more
generic
relation
types
.
When
a
parser
GR
has
a
weight
of
less
than
one
,
we
proportionally
discount
its
contribution
to
the
precision
and
recall
scores
.
Thus
,
given
a
set
of
GRs
with
associated
weights
produced
by
the
parser
,
i.e.
and
a
set
of
gold-standard
(
unweighted
)
GRs
,
we
compute
the
weighted
match
between
and
the
elements
of
as
where
if
is
true
and
otherwise
.
The
weighted
precision
and
recall
are
then
respectively
,
expressed
as
percentages
.
We
are
not
aware
of
any
previous
published
work
using
Table
1
:
GR
accuracy
comparing
extraction
from
just
the
highest-ranked
parse
compared
to
weighted
GR
extraction
from
all
parses
.
Best
parse
All
parses
weighted
precision
and
recall
measures
,
although
there
is
an
option
for
associating
weights
with
complete
parses
in
the
distributed
software
implementing
the
PARSEVAL
scheme
(
Harrison
et
al.
,
1991
)
forevaluating
parseraccuracy
withrespectto
phrase
structure
bracketings
.
The
weighted
measures
make
sense
for
application
tasks
that
can
deal
with
sets
of
mutually-inconsistent
GRs
.
In
this
initial
experiment
,
precision
and
recall
when
extracting
weighted
GRs
from
all
parses
were
both
one
and
a
half
percentage
points
lower
than
when
GRs
were
extracted
from
just
the
highest
ranked
analysis
(
see
table
1
)
5
.
This
decrease
in
accuracy
might
be
expected
,
though
,
given
that
a
true
positive
GR
may
be
returned
with
weight
less
than
one
,
and
so
will
not
receive
full
credit
from
the
weighted
precision
and
recall
measures
.
However
,
these
results
only
tell
part
of
the
story
.
An
application
using
grammatical
relation
analyses
might
be
interested
only
in
GRs
that
the
parser
is
fairly
confident
of
being
correct
.
For
instance
,
in
un-supervised
acquisition
of
lexical
information
(
such
as
subcategorisation
frames
for
verbs
)
from
text
,
the
usual
methodology
is
to
(
partially
)
analyse
the
text
,
retaining
only
reliable
hypotheses
which
are
then
filtered
based
on
the
amount
of
evidence
for
them
over
the
corpus
as
a
whole
.
Thus
,
Brent
(
1993
)
only
creates
hypotheses
on
the
basis
of
instances
of
verb
frames
that
are
reliably
and
unambiguously
cued
by
closed
class
items
(
such
as
pronouns
)
so
there
can
be
no
other
attachment
possibilities
.
In
recent
work
on
unsupervised
learning
ofprepositional
phrase
disambiguation
,
Pantel
and
Lin
(
2000
)
derive
training
instances
only
from
relevant
data
appearing
in
syntactic
contexts
that
are
guaranteed
to
be
unambiguous
.
In
our
system
,
the
weights
on
GRs
indicate
how
certain
the
parser
is
of
the
associated
relations
being
correct
.
We
therefore
investigated
whether
more
highly
weighted
GRs
are
in
fact
more
likely
4At
http
:
/
/
www.cogs.susx.ac.uk
/
lab
/
nlp
/
carroll
/
greval.html
.
5Ignoring
the
weights
on
GRs
,
standard
(
unweighted
)
evaluation
results
for
all
parses
are
:
precision
36.65
%
,
recall
89.42
%
and
F-score
51.99
.
Figure
1
:
Weighted
GRs
for
the
sentence
Failure
to
do
on
Fulton
taxpayers
.
this
will
continue
to
place
a
disproportionate
burden
Threshold
=
G
Threshold
=
1
Figure
2
:
Weighted
GR
accuracy
as
the
threshold
is
varied
.
to
be
correct
than
ones
with
lower
weights
.
We
did
this
by
setting
a
threshold
on
the
output
,
such
that
any
GR
with
weight
lower
than
the
threshold
is
discarded
.
Figure
2
plots
weighted
recall
and
precision
as
the
threshold
is
varied
between
zero
and
one
The
results
are
intriguing
.
Precision
increases
monoton-ically
from
74.6
%
at
a
threshold
of
zero
(
the
situation
as
in
the
previous
experiment
where
all
GRs
extracted
from
all
parses
in
the
forest
are
returned
)
to
90.4
%
at
a
threshold
of
one
.
(
The
latter
threshold
has
the
effect
of
allowing
only
those
GRs
that
form
part
of
every
single
analysis
to
be
returned
)
.
The
influence
of
the
threshold
on
recall
is
equally
dramatic
,
although
since
we
have
not
escaped
the
usual
trade-off
with
precision
the
results
are
somewhat
less
positive
.
Recall
decreases
from
75.3
%
to
45.2
%
,
initially
rising
slightly
,
then
falling
at
a
gradually
increasing
rate
.
Between
thresholds
0.99
and
1.0
there
is
only
a
two
percentage
point
difference
in
precision
,
but
recall
differs
by
almost
fourteen
percentage
points6
.
Over
the
whole
range
,
as
the
threshold
is
increased
from
zero
,
precision
rises
faster
than
recall
falls
until
the
threshold
reaches
0.65
;
here
the
F-score
attains
its
overall
maximum
of77
.
It
turns
out
that
the
eventual
figure
of
over
90
%
precision
is
not
due
to
'
easier
'
relation
types
(
such
as
the
dependency
between
a
determiner
and
a
noun
)
being
returned
and
more
difficult
ones
(
for
example
clausal
complements
)
being
ignored
.
The
majority
of
relation
types
are
produced
with
frequency
consistent
with
the
overall
45
%
recall
figure
.
Exceptions
are
argjnod
(
encoding
the
English
passive
'
by-phrase
'
)
and
iobj
(
indirect
object
)
,
for
which
no
GRs
at
all
are
produced
.
The
reason
for
this
is
that
both
types
of
relation
originate
from
an
occurrence
of
a
prepositional
phrase
in
contexts
where
it
could
be
either
a
modifier
or
a
complement
of
a
predicate
.
This
pervasive
ambiguity
means
that
there
will
always
be
disagreement
between
analyses
over
the
relation
type
(
but
not
necessarily
over
the
identity
of
the
head
and
dependent
themselves
)
.
Schmid
and
Rooth
's
algorithm
computes
expected
governors
efficiently
by
using
dynamic
programming
and
processing
the
entire
parse
forest
rather
than
individual
trees
.
In
contrast
,
we
unpack
the
whole
parse
forest
and
then
extract
weighted
GRs
from
each
tree
individually
.
Our
implementation
is
certainly
less
elegant
,
but
in
practical
terms
for
6Roughly
,
each
percentage
point
increase
or
decrease
in
precision
and
recall
is
statistically
significant
at
the
95
%
level
.
In
this
and
all
significance
tests
in
this
paper
we
use
a
one-tailed
paired
t-test
(
with
499
degrees
of
freedom
)
.
sentences
where
there
are
relatively
small
numbers
of
parses
the
speed
is
still
acceptable
.
However
,
throughput
goes
down
linearly
with
the
number
of
parses
,
and
when
there
are
many
thousands
of
parses
—
and
particularly
also
when
the
sentence
is
long
and
so
each
tree
is
large
—
the
parsing
system
becomes
unacceptably
slow
.
One
possibility
to
improve
the
situation
would
be
to
extract
GRs
directly
from
forests
.
At
first
glance
this
looks
a
possibility
:
although
our
parse
forests
are
produced
by
a
probabilistic
LR
parser
using
a
unification-based
grammar
,
they
are
similar
in
content
to
those
computed
by
a
probabilistic
context-free
grammar
,
as
assumed
by
Schmid
and
Rooth
's
algorithm
.
However
,
there
are
problems
.
If
the
test
for
being
able
to
pack
local
ambiguities
in
the
unification
grammar
parse
forest
is
feature
structure
sub-sumption
,
unpacking
a
parse
apparently
encoded
in
the
forest
can
fail
due
to
non-local
inconsistency
in
feature
values
(
Oepen
and
Carroll
,
2000
)
7
,
so
every
governor
tuple
hypothesis
would
have
to
be
checked
to
ensure
that
the
parse
it
came
from
was
globally
valid
.
It
is
likely
that
this
verification
step
would
cancel
out
the
efficiency
gained
from
using
an
algorithm
based
on
dynamic
programming
.
This
problem
could
be
side-stepped
(
but
at
the
cost
of
less
compact
parse
forests
)
by
instead
testing
for
feature
structure
equivalence
rather
than
subsumption
.
A
second
,
more
serious
problem
is
that
some
of
our
relation
types
encode
more
information
than
is
present
in
a
single
governor
tuple
(
the
non-clausal
subject
relation
,
for
instance
,
encoding
whether
the
surface
subject
is
the
'
deep
'
object
in
a
passive
construction
)
;
this
information
can
again
be
less
local
and
violate
the
conditions
required
for
the
dynamic
programming
approach
.
Another
possibility
is
to
compute
only
the
highest
ranked
parses
and
extract
weighted
GRs
from
just
those
.
The
basic
case
where
is
equivalent
to
the
standard
approach
of
computing
GRs
from
the
highest
probability
parse
.
Table
2
shows
the
effect
on
accuracy
as
is
increased
in
stages
to
,
using
a
threshold
for
GR
extraction
of
;
also
shown
is
the
previous
setup
(
labelled
'
unlimited
'
)
in
which
all
parses
in
the
forest
are
considered.8
(
All
differences
in
precision
in
the
table
are
significant
to
at
least
the
95
%
level
,
except
between
1000
parses
and
7The
forest
therefore
also
'
leaks
'
probability
mass
since
it
contains
derivations
that
are
in
fact
not
legal
.
8At
n
=
1000
parses
,
the
(
unlabelled
)
weighted
precision
of
head-dependent
pairs
is
91.0
%
.
Table
2
:
Weighted
GR
accuracy
using
a
threshold
of
1
,
with
respect
to
the
maximum
number
of
ranked
parses
considered
.
Maximum
Parses
unlimited
an
unlimited
number
)
.
The
results
demonstrate
that
limiting
processing
to
a
relatively
small
,
fixed
number
of
parses
—
even
as
low
as
100
—
comes
within
a
small
margin
of
the
accuracy
achieved
using
the
full
parse
forest
.
These
results
are
striking
,
in
view
of
the
fact
that
the
grammar
assigns
more
than
parses
to
over
a
third
of
the
sentences
in
the
test
corpus
,
and
more
than
a
thousand
parses
to
a
fifth
of
them
.
Another
interesting
observation
is
that
the
relationship
between
precision
and
recall
is
very
close
to
that
seen
when
the
threshold
is
varied
(
as
in
the
previous
section
)
;
there
appears
to
be
no
loss
in
recall
at
a
given
level
of
precision
.
We
therefore
feel
confident
in
unpacking
a
limited
number
of
parses
from
the
forest
and
extracting
weighted
GRs
from
them
,
rather
than
trying
to
process
all
parses
.
We
have
tentatively
set
the
limit
to
be
,
as
a
reasonable
compromise
in
our
system
between
throughput
and
accuracy
.
The
way
in
which
the
GR
weighting
is
carried
out
does
not
matter
when
the
weight
threshold
is
equal
to
1
(
since
then
only
GRs
that
are
part
of
every
analysis
are
returned
,
each
with
a
weight
of
one
)
.
However
,
we
wanted
to
see
whether
the
precise
method
for
assigning
weights
to
GRs
has
an
effect
on
accuracy
,
and
if
so
,
to
what
extent
.
We
therefore
tried
an
alternative
approach
where
each
GR
receives
a
contribution
of
1
from
every
parse
,
no
matter
what
the
probability
of
the
parse
is
,
normalising
in
this
case
by
the
number
of
parses
considered
.
This
tends
to
increase
the
numbers
of
GRs
returned
for
any
given
threshold
,
so
when
comparing
the
two
methods
we
found
thresholds
such
that
each
method
obtained
the
same
precision
figure
(
of
roughly
83.38
%
)
.
We
then
compared
the
recall
figures
(
see
table
3
)
.
The
recall
Table
3
:
Accuracy
at
the
same
level
of
precision
using
different
weighting
methods
,
with
a
1000-parse
tree
limit
.
Weighting
Method
for
the
probabilistic
weighting
scheme
is
4
%
higher
(
statistically
significant
at
the
99.95
%
level
)
.
3.4
Maximal
Consistent
Relation
Sets
It
is
interesting
to
see
what
happens
if
we
compute
for
each
sentence
the
maximal
consistent
set
of
weighted
GRs
.
(
We
might
want
to
do
this
ifwe
want
complete
and
coherent
sentence
analyses
,
interpreting
the
weights
as
confidence
measures
over
sub-analysis
segments
)
.
We
use
a
'
greedy
'
algorithm
to
compute
consistent
relation
sets
,
taking
GRs
sorted
in
order
of
decreasing
weight
and
adding
a
GR
to
the
set
if
and
only
if
there
is
not
already
a
GR
in
the
set
with
the
same
dependent
.
(
But
note
that
the
correct
analysis
may
in
fact
contain
more
than
one
GR
with
the
same
dependent
,
such
as
the
nc-subj
.
.
.
Failure
GRs
in
Figure
1
,
and
in
these
cases
this
method
will
introduce
errors
)
.
The
weighted
precision
,
recall
and
F-score
at
threshold
zero
are
79.31
%
,
73.56
%
and
76.33
respectively
.
Precision
and
F-score
are
significantly
better
(
at
the
95.95
%
level
)
than
the
baseline
.
3.5
Parser
Bootstrapping
One
of
our
primary
research
goals
is
to
explore
un-supervised
acquisition
of
lexical
knowledge
.
The
parser
we
use
in
this
work
is
'
semi-lexicalised
'
,
using
subcategorisation
probabilities
for
verbs
acquired
automatically
from
(
unlexicalised
)
parses
.
In
the
future
we
intend
to
acquire
other
types
oflexico-statistical
information
(
for
example
on
PP
attachment
)
which
we
will
feed
back
into
the
parser
's
disambiguation
procedure
,
bootstrapping
successively
more
accurate
versions
ofthe
parsing
system
.
There
is
still
plenty
of
scope
for
improvement
in
accuracy
,
since
compared
with
the
number
of
correct
GRs
in
top-ranked
parses
there
are
roughly
a
further
20
%
that
are
correct
but
present
only
in
lower-ranked
parses
.
There
appears
to
be
less
room
for
improvement
with
argument
relations
(
ncsubj
,
dobj
etc.
)
than
with
modifier
relations
(
ncmod
and
similar
)
.
This
indicates
that
our
next
efforts
should
be
directed
to
collecting
information
on
modification
.
4
Discussion
and
Further
Work
We
have
extended
a
shallow
parsing
system
for
English
that
returns
analyses
in
the
form
of
sets
of
grammatical
relations
,
presenting
an
investigation
into
the
extraction
of
weighted
relations
from
probabilistic
parses
.
We
observed
that
setting
a
threshold
on
the
output
such
that
any
relation
with
weight
lower
than
the
threshold
is
discarded
allows
a
tradeoff
to
be
made
between
recall
and
precision
,
and
found
that
by
setting
the
threshold
at
1
the
precision
of
the
system
was
boosted
dramatically
,
from
a
baseline
of
75
%
to
over
90
%
.
With
this
setting
,
the
system
returns
only
relations
that
form
part
of
all
analyses
licensed
by
the
grammar
:
the
system
can
have
no
greater
certainty
that
these
relations
are
correct
,
given
the
knowledge
that
is
available
to
it
.
Although
we
believe
this
technique
to
be
well
suited
to
probabilistic
parsers
,
it
could
also
potentially
benefit
any
parsing
system
that
can
represent
ambiguity
and
return
analyses
that
are
composed
of
a
collection
of
elementary
units
.
Such
a
system
need
not
necessarily
be
statistical
,
since
parse
probabilities
make
no
difference
when
checking
that
a
given
sub-analysis
segment
forms
part
of
all
possible
global
analyses
.
Moreover
,
a
non-statistical
parsing
system
could
use
the
the
technique
to
construct
a
reliable
annotated
corpus
automatically
,
which
it
could
then
be
trained
on
.
Acknowledgements
We
are
grateful
to
Mats
Rooth
for
early
discussions
about
his
expected
governor
label
work
.
This
research
was
supported
by
UK
EPSRC
projects
GR
/
N36462
/
93
'
Robust
Accurate
Statistical
Parsing
(
RASP
)
'
and
by
EU
FP5
project
IST-2001-34460
'
MEANING
:
Developing
Multilingual
Web-scale
Language
Technologies
'
.
