We
present
our
system
used
in
the
CoNLL
2007
shared
task
on
multilingual
parsing
.
The
system
is
composed
of
three
components
:
a
k-best
maximum
spanning
tree
(
MST
)
parser
,
a
tree
labeler
,
and
a
reranker
that
orders
the
k-best
labeled
trees
.
We
present
two
techniques
for
training
the
MST
parser
:
tree-normalized
and
graph-normalized
conditional
training
.
The
tree-based
reranking
model
allows
us
to
explicitly
model
global
syntactic
phenomena
.
We
describe
the
reranker
features
which
include
non-projective
edge
attributes
.
We
provide
an
analysis
of
the
errors
made
by
our
system
and
suggest
changes
to
the
models
and
features
that
might
rectify
the
current
system
.
1
Introduction
Reranking
the
output
of
a
k-best
parser
has
been
shown
to
improve
upon
the
best
results
of
a
state-of-the-art
constituency
parser
(
Charniak
and
Johnson
,
2005
)
.
This
is
primarily
due
to
the
ability
to
incorporate
complex
structural
features
that
cannot
be
modeled
under
a
CFG
.
Recent
work
shows
that
k-best
maximum
spanning
tree
(
MST
)
parsing
and
reranking
is
also
viable
(
Hall
,
2007
)
.
In
the
current
work
,
we
explore
the
k-best
MST
parsing
paradigm
along
with
a
tree-based
reranker
.
A
system
using
the
parsing
techniques
presented
in
this
paper
was
entered
in
the
CoNLL
2007
shared
task
competition
(
Nivre
et
al.
,
2007
)
.
This
task
evaluated
parsing
performance
on
10
languages
:
Arabic
,
Basque
,
Catalan
,
Chinese
,
Czech
,
English
,
Greek
,
Hungarian
,
Italian
,
and
Turkish
using
data
originating
from
a
wide
variety
of
dependency
treebanks
,
and
transformations
of
constituency-based
treebanks
(
HajiC
et
al.
,
2004
;
Aduriz
et
al.
,
2003
;
Marti
et
al.
,
2007
;
Chen
et
al.
,
2003
;
Bohmova
et
al.
,
2003
;
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Prokopidis
et
al.
,
2005
;
Csendes
et
al.
,
2005
;
Montemagni
et
al.
,
2003
;
Oflazer
et
al.
,
2003
)
.
We
show
that
oracle
parse
accuracy1
of
the
output
of
our
k-best
parser
is
generally
higher
than
the
best
reported
results
.
We
also
present
the
results
of
a
reranker
based
on
a
rich
set
of
structural
features
,
including
features
explicitly
targeted
at
modeling
non-projective
configurations
.
Labeling
of
the
dependency
edges
is
accomplished
by
an
edge
labeler
based
on
the
same
feature
set
as
used
in
training
the
k-best
MST
parser
.
2
Parser
Description
Our
parser
is
composed
of
three
components
:
a
k-best
MST
parser
,
a
tree-labeler
,
and
a
tree-reranker
.
Log-linear
models
are
used
for
each
of
the
components
independently
.
In
this
section
we
give
an
overview
of
the
models
,
the
training
techniques
,
and
the
decoders
.
The
connection
between
the
maximum
spanning
tree
problem
and
dependency
parsing
stems
from
the
observation
that
a
dependency
parse
is
simply
an
oriented
spanning
tree
on
the
graph
of
all
possible
1The
oracle
accuracy
for
a
set
of
hypotheses
is
the
maximal
accuracy
for
any
of
the
hypotheses
.
dependency
links
(
the
fully
connected
dependency
graph
)
.
Unfortunately
,
by
mapping
the
problem
to
a
graph
,
we
assume
that
the
scores
associated
with
edges
are
independent
,
and
thus
,
are
limited
to
edge-factored
models
.
Edge-factored
models
are
severely
limited
in
their
capacity
to
predict
structure
.
In
fact
,
they
can
only
directly
model
parent-child
links
.
In
order
to
alleviate
this
,
we
use
a
k-best
MST
parser
to
generate
a
set
of
candidate
hypotheses
.
Then
,
we
rerank
these
trees
using
a
model
based
on
rich
structural
features
that
model
features
such
as
valency
,
subcategoriza-tion
,
ancestry
relationships
,
and
sibling
interactions
,
as
well
as
features
capturing
the
global
structure
of
dependency
trees
,
aimed
primarily
at
modeling
language
specific
non-projective
configurations
.
We
assign
dependency
labels
to
entire
trees
,
rather
than
predicting
the
labels
during
tree
construction
.
Given
that
we
have
a
reranking
process
,
we
can
label
the
k-best
tree
hypotheses
output
from
our
MST
parser
,
and
rerank
the
labeled
trees
.
We
have
explored
both
labeled
and
unlabeled
reranking
.
In
the
latter
case
,
we
simply
label
the
maximal
unlabeled
tree
.
McDonald
et
al.
(
2005
)
present
a
technique
for
training
discriminative
models
for
dependency
parsing
.
The
edge-factored
models
we
use
for
MST
parsing
are
closely
related
to
those
described
in
the
previous
work
,
but
allow
for
the
efficient
computation
of
normalization
factors
which
are
required
for
first
and
second-order
(
gradient-based
)
training
techniques
.
We
consider
two
estimation
procedures
for
parent-prediction
models
.
A
parent-prediction
model
assigns
a
conditional
score
s
(
g
\
d
)
for
every
parent-child
pair
(
we
denote
the
parent
/
governor
g
,
and
the
child
/
dependent
d
)
,
where
s
(
g
\
d
)
=
s
(
g
,
d
)
/
Ylg
'
s
(
g
'
,
d
)
.
In
our
work
,
we
compute
probabilities
p
(
g
\
d
)
based
on
conditional
log-linear
models
.
This
is
an
approximation
to
a
generative
model
that
predicts
each
node
once
(
i.e.
,
nd
p
(
d
\
g
)
)
.
In
the
graph-normalized
model
,
we
assume
that
the
conditional
distributions
are
independent
of
one
another
.
In
particular
,
we
find
the
model
parameters
that
maximize
the
likelihood
of
p
(
g
*
\
d
)
,
where
g
*
is
the
correct
parent
in
the
training
data
.
We
per
-
form
the
optimization
over
the
entire
training
set
,
tying
the
feature
parameters
.
In
particular
,
we
perform
maximum
entropy
(
MaxEnt
)
estimation
over
the
conditional
distribution
using
second-order
gradient
descent
optimization
techniques.2
An
advantage
of
the
parent-prediction
model
is
that
we
can
frame
the
estimation
problem
as
that
of
minimum-error
training
with
a
zero-one
loss
term
:
where
e
G
{
0,1
}
is
the
error
term
(
e
is
1
for
the
correct
parent
and
0
for
all
other
nodes
)
and
Zd
=
Ej
exp
(
Xi
Xifi
(
ej
,
gj
,
d
)
)
is
the
normalization
constant
for
node
d.
Note
that
the
normalization
factor
considers
all
graphs
with
in-degree
zero
for
the
root
node
and
in-degree
one
for
other
nodes
.
At
parsing
time
,
of
course
,
our
parent
predictions
are
constrained
to
produce
a
(
non-projective
)
tree
structure
.
We
can
sum
over
all
non-projective
spanning
trees
by
taking
the
determinant
of
the
Kirchhoff
matrix
of
the
graph
defined
above
,
minus
the
row
and
column
corresponding
to
the
root
node
(
Smith
and
Smith
,
2007
)
.
Training
graph-normalized
and
tree-normalized
models
under
identical
conditions
,
we
find
tree
normalization
wins
by
0.5
%
to
1
%
absolute
dependency
accuracy
.
Although
tree
normalization
also
shows
a
(
smaller
)
advantage
in
k-best
oracle
accuracy
,
we
do
not
believe
it
would
have
a
large
effect
on
our
reranking
results
.
The
reranker
is
based
on
a
conditional
log-linear
model
subject
to
the
MaxEnt
constraints
using
the
same
second-order
optimization
procedures
as
the
graph-normalized
MST
models
.
The
primary
difference
here
is
that
there
is
no
single
correct
tree
in
the
set
of
k
candidate
parse
trees
.
Instead
,
we
have
k
trees
that
are
generated
by
our
k-best
parser
,
each
with
a
score
assigned
by
the
parser
.
If
we
are
performing
labeled
reranking
,
we
label
each
of
these
hypotheses
with
l
possible
labelings
,
each
with
a
score
assigned
by
the
labeler
.
As
with
the
parent-prediction
,
graph-normalized
model
,
we
perform
minimum-error
training
.
The
2For
the
graph-normalized
models
,
we
use
L-BFGS
optimization
provided
through
the
TAO
/
PETSC
optimization
library
(
Benson
et
al.
,
2005
;
Balay
et
al.
,
2004
)
.
optimization
is
achieved
by
assuming
the
oracle-best
parse
(
s
)
are
correct
and
the
remaining
hypotheses
are
incorrect
.
Furthermore
,
the
feature
values
are
scaled
according
to
the
relative
difference
between
the
oracle-best
score
and
the
score
assigned
to
the
non-oracle-best
hypothesis
.
Note
that
any
reranker
could
be
used
in
place
of
our
current
model
.
We
have
chosen
to
keep
the
reranker
model
closely
related
to
the
MST
parsing
model
so
that
we
can
share
feature
representations
and
training
procedures
.
We
used
the
same
edge
features
to
train
a
separate
log-linear
labeling
model
.
Each
edge
feature
was
conjoined
with
a
potential
label
,
and
we
then
maximized
the
likelihood
of
the
labeling
in
the
training
data
.
Since
this
model
is
also
edge-factored
,
we
can
store
the
labeler
scores
for
each
of
the
n2
potential
edges
in
the
dependency
tree
.
In
the
submitted
system
,
we
simply
extracted
the
Viterbi
predictions
of
the
labeler
for
the
unlabeled
trees
selected
by
the
reranker
.
We
also
(
see
below
)
ran
experiments
where
each
entry
in
the
k-best
lists
input
as
training
data
to
the
reranker
was
augmented
by
its
l-best
la-belings
.
We
hoped
thereby
to
inject
more
diversity
into
the
resulting
structures
.
Our
MST
models
are
based
on
the
features
described
in
(
Hall
,
2007
)
;
specifically
,
we
use
features
based
on
a
dependency
nodes
'
form
,
lemma
,
coarse
and
fine
part-of-speech
tag
,
and
morphological-string
attributes
.
Additionally
,
we
use
surface-string
distance
between
the
parent
and
child
,
buckets
of
features
indicating
if
a
particular
form
/
lemma
/
tag
occurred
between
or
next
to
the
parent
and
child
,
and
a
branching
feature
indicating
whether
the
child
is
to
the
left
or
right
of
the
parent
.
Composite
features
,
combining
the
above
features
are
also
included
(
e.g.
,
a
single
feature
combining
branching
,
parent
&amp;
child
form
,
parent
&amp;
child
tag
)
.
The
tree-based
reranker
includes
the
features
described
in
(
Hall
,
2007
)
as
well
as
features
based
on
non-projective
edge
attributes
explored
in
(
Havelka
,
2007a
;
Havelka
,
2007b
)
.
One
set
of
features
models
relationships
of
nodes
with
their
siblings
,
including
valency
and
subcategorization
.
A
second
set
of
features
models
global
tree
structure
and
includes
features
based
on
a
node
's
ancestors
and
the
depth
and
size
of
its
subtree
.
A
third
set
of
features
models
the
interaction
of
word
order
and
tree
structure
as
manifested
on
individual
edges
,
i.e.
,
the
features
model
language
specific
projective
and
non-projective
configurations
.
They
include
edge-based
features
corresponding
to
the
global
constraints
of
projectivity
,
planarity
and
well-nestedness
,
and
for
non-projective
edges
,
they
furthermore
include
level
type
,
level
signature
and
ancestor-in-gap
features
.
All
features
allow
for
an
arbitrary
degree
of
lexical-ization
;
in
the
reported
results
,
the
first
two
sets
of
features
use
coarse
and
fine
part-of-speech
lexical-izations
,
while
the
features
in
the
third
set
are
used
in
their
unlexicalized
form
due
to
time
limitations
.
3
Results
and
Analysis
Hall
(
2007
)
shows
that
the
oracle
parsing
accuracy
of
a
k-best
edge-factored
MST
parser
is
considerably
higher
than
the
one-best
score
of
the
same
parser
,
even
when
k
is
small
.
We
have
verified
that
this
is
true
for
the
CoNLL
shared-task
data
by
evaluating
the
oracle
rates
on
a
randomly
sampled
development
set
for
each
language
.
In
order
to
select
optimal
model
parameters
for
the
MST
parser
,
the
labeler
,
and
reranker
,
we
sampled
approximately
200
sentences
from
each
training
set
to
use
as
a
development
test
set
.
Training
the
reranker
requires
a
jackknife
n-fold
training
procedure
where
n
—
1
partitions
are
used
to
train
a
model
that
parses
the
remaining
partition
.
This
is
done
n
times
to
generate
k-best
parses
for
the
entire
training
set
without
using
models
trained
on
the
data
they
are
run
on
.
For
lack
of
space
,
we
report
only
results
on
the
CoNLL
evaluation
data
set
here
,
but
note
that
the
trends
observed
on
the
evaluation
data
are
identical
to
those
observed
on
our
development
sets
.
In
Table
1
we
present
results
for
labeled
(
and
un-labeled
)
dependency
accuracy
on
the
CoNLL
2007
evaluation
data
set
.
We
report
the
oracle
accuracy
for
different
sized
k-best
hypothesis
sets
.
The
columns
are
labeled
by
the
number
of
trees
output
from
the
MST
parser
,
k
;
3
and
by
the
number
of
al
-
3All
results
are
reported
for
the
graph-normalized
training
technique
.
Language
Oracle
Accuracy
Reranked
Reported
Hungarian
Table
1
:
Labeled
(
unlabeled
)
attachment
accuracy
for
k-best
MST
oracle
results
and
reranked
data
on
the
evaluation
set
.
The
1-best
results
(
k
=
1
,
l
=
1
)
represent
the
performance
of
the
MST
parser
without
reranking
.
The
New
Reranked
field
shows
recent
unlabeled
reranking
results
of
50-best
trees
using
a
modified
feature
set
.
For
arabic
,
we
only
report
unlabeled
accuracy
for
different
k
and
l.
ternative
labelings
for
each
tree
,
l.
When
k
=
1
,
the
score
is
the
best
achievable
by
the
edge-factored
MST
parser
using
our
models
.
As
k
increases
,
the
oracle
parsing
accuracy
increases
.
The
most
extreme
difference
between
the
one-best
accuracy
and
the
50-best
oracle
accuracy
can
be
seen
for
Turkish
where
there
is
a
difference
of
9.64
points
of
accuracy
(
8.77
for
the
unlabeled
trees
)
.
This
means
that
the
reranker
need
only
select
the
correct
tree
from
a
set
of
50
to
increase
the
score
by
9.64
%
.
As
our
reranking
results
show
,
this
is
not
as
simple
as
it
may
appear
.
We
report
the
results
for
our
CoNLL
submission
as
well
as
recent
results
based
on
alternative
parameters
optimization
on
the
development
set
.
We
report
the
latest
results
only
for
unlabeled
accuracy
of
reranking
50-best
MST
output
.
4
Conclusion
Our
submission
to
the
CoNLL
2007
shared
task
on
multilingual
parsing
supports
the
hypothesis
that
edge-factored
MST
parsing
is
viable
given
an
effective
reranker
.
The
reranker
used
in
our
submission
was
unable
to
achieve
the
oracle
rates
.
We
believe
this
is
primarily
related
to
a
relatively
impoverished
feature
set
.
Due
to
time
constraints
,
we
have
not
been
able
to
train
lexicalized
reranking
models
.
The
introduction
of
lexicalized
features
in
the
reranker
should
influence
the
selection
of
better
trees
,
which
we
know
exist
in
the
k-best
hypothesis
sets
.
