We
present
experiments
with
a
dependency
parsing
model
defined
on
rich
factors
.
Our
model
represents
dependency
trees
with
factors
that
include
three
types
of
relations
between
the
tokens
of
a
dependency
and
their
children
.
We
extend
the
projective
parsing
algorithm
of
Eisner
(
1996
)
for
our
case
,
and
train
models
using
the
averaged
percep-tron
.
Our
experiments
show
that
considering
higher-order
information
yields
significant
improvements
in
parsing
accuracy
,
but
comes
at
a
high
cost
in
terms
of
both
time
and
memory
consumption
.
In
the
multilingual
exercise
of
the
CoNLL-2007
shared
task
(
Nivre
et
al.
,
2007
)
,
our
system
obtains
the
best
accuracy
for
English
,
and
the
second
best
accuracies
for
Basque
and
Czech
.
1
Introduction
Structured
prediction
problems
usually
involve
models
that
work
with
factored
representations
of
structures
.
The
information
included
in
the
factors
determines
the
type
of
features
that
the
model
can
exploit
.
However
,
richer
representations
translate
into
higher
complexity
of
the
inference
algorithms
associated
with
the
model
.
In
dependency
parsing
,
the
basic
first-order
model
is
defined
by
a
decomposition
of
a
tree
into
head-modifier
dependencies
.
Previous
work
extended
this
basic
model
to
include
second-order
relations
—
i.e.
dependencies
that
are
adjacent
to
the
main
dependency
of
the
factor
.
Specifically
,
these
approaches
considered
sibling
relations
of
the
modifier
token
(
Eisner
,
1996
;
McDonald
and
Pereira
,
2006
)
.
In
this
paper
we
extend
the
parsing
model
with
other
types
of
second-order
relations
.
In
particular
,
we
incorporate
relations
between
the
head
and
modifier
tokens
and
the
children
of
the
modifier
.
One
paradigmatic
case
where
the
relations
we
consider
are
relevant
is
PP-attachment
.
For
example
,
in
"
They
sold
1,210
cars
in
the
U.S.
"
,
the
ambiguity
problem
is
to
determine
whether
the
preposition
"
in
"
(
which
governs
"
the
U.S.
"
)
is
modifying
"
sold
"
or
"
cars
"
,
the
former
being
correct
in
this
case
.
It
is
generally
accepted
that
to
solve
the
attachment
decision
it
is
necessary
to
look
at
the
head
noun
within
the
prepositional
phrase
(
i.e.
,
"
U.S.
"
in
the
example
)
,
which
has
a
grand-parental
relation
with
the
two
candidate
tokens
that
the
phrase
may
attach
—
see
e.g.
(
Ratnaparkhi
et
al.
,
1994
)
.
Other
ambiguities
in
language
may
also
require
consideration
of
grand-parental
relations
in
the
dependency
structure
.
We
present
experiments
with
higher-order
models
trained
with
averaged
perceptron
.
The
second-order
relations
that
we
incorporate
in
the
model
yield
significant
improvements
in
accuracy
.
However
,
the
inference
algorithms
for
our
factorization
are
very
expensive
in
terms
of
time
and
memory
consumption
,
and
become
impractical
when
dealing
with
many
labels
or
long
sentences
.
2
Higher-Order
Projective
Models
Fi
gure
1
:
A
factor
in
the
higher-order
parsing
model
.
m
e
[
1
.
.
.
n
]
is
the
index
of
the
modifier
token
,
and
l
e
[
1
.
.
.
L
]
is
the
label
of
the
dependency
.
The
value
h
=
0
is
used
for
dependencies
where
the
head
is
a
special
root-symbol
of
the
sentence
.
We
denote
by
T
(
x
)
the
set
of
all
possible
dependency
structures
for
a
sentence
x.
In
this
paper
,
we
restrict
to
projective
dependency
trees
.
The
dependency
tree
computed
by
the
parser
for
a
given
sentence
is
:
The
parsing
model
represents
a
structure
y
as
a
set
of
factors
,
f
e
y
,
and
scores
each
factor
using
parameters
w.
In
a
first-order
model
a
factor
corresponds
to
a
single
labeled
dependency
,
i.e.
f
=
(
h
,
m
,
l
)
.
The
features
of
the
model
are
defined
through
a
feature
function
0i
(
x
,
h
,
m
)
which
maps
a
sentence
together
with
an
unlabeled
dependency
to
a
feature
vector
in
Rdl.
The
parameters
of
the
model
are
a
collection
of
vectors
wl
e
Rdl
,
one
for
each
possible
label
.
The
first-order
model
scores
a
factor
as
score1
(
w
,
x
,
(
h
,
m
,
l
)
)
=
01
(
x
,
h
,
m
)
•
wl.
The
higher-order
model
defined
in
this
paper
decomposes
a
dependency
structure
into
factors
that
include
children
of
the
head
and
the
modifier
.
In
particular
,
a
factor
in
our
model
is
represented
by
the
signature
f
=
(
h
,
m
,
l
,
ch
,
cmi
,
cmo
)
where
,
as
in
the
first-order
model
,
h
,
m
and
l
are
respectively
the
head
,
modifier
and
label
of
the
main
dependency
of
the
factor
;
ch
is
the
child
of
h
in
[
h.
.
.
m
]
that
is
closest
to
m
;
cmi
is
child
of
m
inside
[
h.
.
.
m
]
that
is
furthest
from
m
;
cmo
is
the
child
of
m
outside
[
h.
.
.
m
]
that
is
furthest
from
m.
Figure
1
depicts
a
factor
of
the
higher-order
model
,
and
Table
1
lists
the
factors
of
an
example
sentence
.
Note
that
a
factor
involves
a
main
labeled
dependency
and
three
adjacent
unlabeled
dependencies
that
attach
to
children
of
h
and
m.
Special
values
are
used
when
either
of
these
children
are
null
.
The
higher-order
model
defines
additional
U.S.
Table
1
:
Higher-order
factors
for
an
example
sentence
.
For
simplicity
,
labels
of
the
factors
have
been
omitted
.
A
first-order
model
considers
only
(
h
,
rn
)
.
The
second-order
model
of
McDonald
and
Pereira
(
2006
)
considers
(
h
,
rn
,
ch
)
.
For
the
PP-attachment
decision
(
factor
in
row
5
)
,
the
higher-order
model
allows
us
to
define
features
that
relate
the
verb
(
'
sold
'
)
with
the
content
word
of
the
prepositional
phrase
(
'
U.S.
'
)
.
second-order
features
through
a
function
^2
(
x
,
h
,
m
,
c
)
which
maps
a
head
,
a
modifier
and
a
child
in
a
feature
vector
in
Rd2
.
The
parameters
of
the
model
are
a
collection
of
four
vectors
for
each
dependency
label
:
wl
e
Rdl
as
in
the
first-order
model
;
and
wh
,
w^j
and
wmo
,
all
three
in
Rd2
and
each
associated
to
one
of
the
adjacent
dependencies
in
the
factor
.
The
score
of
a
factor
is
:
Note
that
the
model
uses
a
common
feature
function
for
second-order
relations
,
but
features
could
be
defined
specifically
for
each
type
of
relation
.
Note
also
that
while
the
higher-order
factors
include
four
dependencies
,
our
modelling
choice
only
exploits
relations
between
the
main
dependency
and
secondary
dependencies
.
Considering
relations
between
secondary
dependencies
would
greatly
increase
the
cost
of
the
associated
algorithms
.
2.1
Parsing
Algorithm
In
this
section
we
sketch
an
extension
of
the
pro-jective
dynamic
programming
algorithm
of
Eisner
(
1996
;
2000
)
for
the
higher-order
model
defined
above
.
The
time
complexity
of
the
algorithm
is
O
(
n4L
)
,
and
the
memory
requirements
are
O
(
n2L
+
n3
)
.
As
in
the
Eisner
approach
,
our
algorithm
visits
sentence
spans
in
a
bottom
up
fashion
,
and
constructs
a
chart
with
two
types
of
dynamic
programming
structures
,
namely
open
and
closed
structures
—
see
Figure
2
for
a
diagram
.
The
dynamic
programming
structures
are
:
Figure
2
:
Dynamic
programming
structures
used
in
the
parsing
algorithm
.
The
variables
in
boldface
constitute
the
index
of
the
chart
entry
for
a
structure
;
the
other
variables
constitute
the
back-pointer
stored
in
the
chart
entry
.
Left
:
an
open
structure
for
the
chart
entry
[
h
,
m
,
l
]
o
;
the
algorithm
looks
for
the
r
,
ch
and
cmi
that
yield
the
optimal
score
for
this
structure
.
Right
:
a
closed
structure
for
the
chart
entry
[
h
,
e
,
m
]
c
;
the
algorithm
looks
for
the
l
and
cmo
that
yield
the
optimal
score
.
•
Open
structures
:
For
each
span
from
s
to
e
and
each
label
l
,
the
algorithm
maintains
a
chart
entry
[
s
,
e
,
l
]
O
associated
to
the
dependency
(
s
,
e
,
l
)
.
For
each
entry
,
the
algorithm
looks
for
the
optimal
splitting
point
r
,
sibling
ch
and
grand-child
cmi
using
parameters
wl
,
wh
and
wj^.
This
can
be
done
in
O
(
n2
)
because
our
features
do
not
consider
interactions
between
ch
and
cmi
.
Similar
entries
[
e
,
s
,
l
]
O
are
maintained
for
dependencies
headed
at
e.
•
Closed
structures
:
For
each
span
from
s
to
e
and
each
token
m
e
[
s.
.
.
e
]
,
the
algorithm
maintains
an
entry
[
s
,
e
,
m
]
C
associated
to
a
partial
dependency
tree
rooted
at
s
in
which
m
is
the
last
modifier
of
s.
The
algorithm
chooses
the
optimal
dependency
label
l
and
grand-child
cmo
in
O
(
nL
)
,
using
parameters
wmo
.
Similar
entries
[
e
,
s
,
m
]
C
are
maintained
for
dependencies
headed
at
e.
We
implemented
two
variants
of
the
algorithm
.
The
first
forces
the
root
token
to
participate
in
exactly
one
dependency
.
The
second
allows
many
dependencies
involving
the
root
token
.
For
the
single-root
case
,
it
is
necessary
to
treat
the
root
token
differently
than
other
tokens
.
In
the
experiments
,
we
used
the
single-root
variant
if
sentences
in
the
training
set
satisfy
this
property
.
Otherwise
we
used
the
multi-root
variant
.
were
inspired
by
successful
previous
work
in
first-order
dependency
parsing
(
McDonald
et
al.
,
2005
)
.
The
most
basic
feature
patterns
consider
the
surface
form
,
part-of-speech
,
lemma
and
other
morpho-syntactic
attributes
of
the
head
or
the
modifier
of
a
dependency
.
The
representation
also
considers
complex
features
that
exploit
a
variety
of
conjunctions
of
the
forms
and
part-of-speech
tags
of
the
following
items
:
the
head
and
modifier
;
the
head
,
modifier
,
and
any
token
in
between
them
;
the
head
,
modifier
,
and
the
two
tokens
following
or
preceding
them
.
3
Experiments
and
Results
In
all
experiments
,
we
trained
our
models
using
the
averaged
perceptron
(
Freund
and
Schapire
,
1999
)
,
following
the
extension
of
Collins
(
2002
)
for
structured
prediction
problems
.
To
train
models
,
we
used
"
projectivized
"
versions
of
the
training
dependency
trees.2
'
We
are
grateful
to
the
providers
of
the
treebanks
that
constituted
the
data
for
the
shared
task
(
Hajic
et
al.
,
2004
;
Aduriz
et
al.
,
2003
;
Marti
et
al.
,
2007
;
Chen
et
al.
,
2003
;
Bohmova
et
al.
,
2003
;
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Prokopidis
et
al.
,
2005
;
Csendes
et
al.
,
2005
;
Montemagni
et
al.
,
2003
;
Oflazer
et
al.
,
2003
)
.
2We
obtained
projective
trees
for
training
sentences
by
running
the
projective
parser
with
an
oracle
model
(
that
assigns
a
score
of+
1
to
correct
dependencies
and
-1
otherwise
)
.
First-Order
,
no
averaging
First-Order
Higher-Order
,
ch
Higher-Order
,
Ch
cmo
Higher-Order
,
Ch
cmi
Cmo
Table
2
:
Labeled
attachment
scores
on
validation
data
(
~
10,000
tokens
per
language
)
,
for
different
models
that
exploit
increasing
orders
offactorizations
.
3.1
Impact
of
Higher-Order
Factorization
Our
first
set
of
experiments
looks
at
the
performance
of
different
factorizations
.
We
selected
three
languages
with
a
large
number
of
training
sentences
,
namely
Catalan
,
Czech
and
English
.
To
evaluate
models
,
we
held
out
the
training
sentences
that
cover
the
first
10,000
tokens
;
the
rest
was
used
for
training
.
We
compared
four
models
at
increasing
orders
of
factorizations
.
The
first
is
a
first-order
model
.
The
second
model
is
similar
to
that
of
McDonald
and
Pereira
(
2006
)
:
a
factor
consists
of
a
main
labeled
dependency
and
the
head
child
closest
to
the
modifier
(
ch
)
.
The
third
model
incorporates
the
modifier
child
outside
the
main
dependency
in
the
factorization
(
cmo
)
.
Finally
,
the
last
model
incorporates
the
modifier
child
inside
the
dependency
span
(
cmi
)
,
thus
corresponding
to
the
complete
higherorder
model
presented
in
the
previous
section
.
Table
2
shows
the
accuracies
of
the
models
on
validation
data
.
Each
model
was
trained
for
up
to
10
epochs
,
and
evaluated
at
the
end
of
each
epoch
;
we
report
the
best
accuracy
of
these
evaluations
.
Clearly
,
the
accuracy
increases
as
the
factors
include
richer
information
in
terms
of
second-order
relations
.
The
richest
model
obtains
the
best
accuracy
in
the
three
languages
,
being
much
better
than
that
of
the
first-order
model
.
The
table
also
reports
the
accuracy
of
an
unaveraged
first-order
model
,
illustrating
the
benefits
of
parameter
averaging
.
3.2
Results
on
the
Multilingual
Track
We
trained
a
higher-order
model
for
each
language
,
using
the
averaged
perceptron
.
In
the
experiments
presented
above
we
observed
that
the
algorithm
does
not
over-fit
,
and
that
after
two
or
three
training
epochs
only
small
variations
in
accuracy
occur
.
Based
on
this
fact
,
we
designed
a
criterion
to
train
models
:
we
ran
the
training
algorithm
for
up
to
three
training
sent
.
/
min
.
mem
.
Hungarian
Table
3
:
Performance
of
the
higher-order
projective
models
on
the
multilingual
track
of
the
CoNLL-2007
task
.
The
first
two
columns
report
the
speed
(
in
sentences
per
minute
)
and
memory
requirements
of
the
training
algorithm
—
these
evaluations
were
made
on
the
first
1,000
training
sentences
with
a
Dual-Core
AMD
OpteronTM
Processor
256
at
1.8GHz
with
4GB
of
memory
.
The
last
two
columns
report
unlabelled
(
UAS
)
and
labelled
(
LAS
)
attachment
scores
on
test
data
.
days
of
computation
,
or
a
maximum
of
15
epochs
.
For
Basque
,
Chinese
and
Turkish
we
could
complete
the
15
epochs
.
For
Arabic
and
Catalan
,
we
could
only
complete
2
epochs
.
Table
3
reports
the
performance
of
the
higher-order
projective
models
on
the
ten
languages
of
the
multilingual
track
.
4
Conclusion
We
have
presented
dependency
parsing
models
that
exploit
higher-order
factorizations
of
trees
.
Such
factorizations
allow
the
definition
of
second-order
features
associated
with
sibling
and
grand-parental
relations
.
For
some
languages
,
our
models
obtain
state-of-the-art
results
.
One
drawback
of
our
approach
is
that
the
inference
algorithms
for
higher-order
models
are
very
expensive
.
For
languages
with
many
dependency
labels
or
long
sentences
,
training
and
parsing
becomes
impractical
for
current
machines
.
Thus
,
a
promising
line
of
research
is
the
investigation
of
methods
to
efficiently
incorporate
higher-order
relations
in
discriminative
parsing
.
Acknowledgments
I
am
grateful
to
Terry
Koo
,
Amir
Globerson
and
Michael
Collins
for
their
helpful
comments
relating
this
work
,
and
to
the
anonymous
reviewers
for
their
suggestions
.
A
significant
part
ofthe
system
and
the
code
was
based
on
my
previous
system
in
the
CoNLL-X
task
,
developed
with
Mihai
Surdeanu
and
Lluis
Marquez
at
the
UPC
.
The
author
was
supported
by
the
Catalan
Ministry
of
Innovation
,
Universities
and
Enterprise
.
