This
paper
provides
an
algorithmic
framework
for
learning
statistical
models
involving
directed
spanning
trees
,
or
equivalently
non-projective
dependency
structures
.
We
show
how
partition
functions
and
marginals
for
directed
spanning
trees
can
be
computed
by
an
adaptation
of
Kirchhoff
's
Matrix-Tree
Theorem
.
To
demonstrate
an
application
of
the
method
,
we
perform
experiments
which
use
the
algorithm
in
training
both
log-linear
and
max-margin
dependency
parsers
.
The
new
training
methods
give
improvements
in
accuracy
over
perceptron-trained
models
.
1
Introduction
Learning
with
structured
data
typically
involves
searching
or
summing
over
a
set
with
an
exponential
number
of
structured
elements
,
for
example
the
set
of
all
parse
trees
for
a
given
sentence
.
Methods
for
summing
over
such
structures
include
the
inside-outside
algorithm
for
probabilistic
context-free
grammars
(
Baker
,
1979
)
,
the
forward-backward
algorithm
for
hidden
Markov
models
(
Baum
et
al.
,
1970
)
,
and
the
belief-propagation
algorithm
for
graphical
models
(
Pearl
,
1988
)
.
These
algorithms
compute
marginal
probabilities
and
partition
functions
,
quantities
which
are
central
to
many
methods
for
the
statistical
modeling
of
complex
structures
(
e.g.
,
the
EM
algorithm
(
Baker
,
1979
;
Baum
et
al.
,
1970
)
,
contrastive
estimation
(
Smith
and
Eisner
,
2005
)
,
training
algorithms
for
CRFs
(
Lafferty
et
al.
,
2001
)
,
and
training
algorithms
for
max-margin
models
(
Bartlett
et
al.
,
2004
;
Taskar
et
al.
,
2004a
)
)
.
This
paper
describes
inside-outside-style
algorithms
for
the
case
of
directed
spanning
trees
.
These
structures
are
equivalent
to
non-projective
dependency
parses
(
McDonald
et
al.
,
2005b
)
,
and
more
generally
could
be
relevant
to
any
task
that
involves
learning
a
mapping
from
a
graph
to
an
underlying
spanning
tree
.
Unlike
the
case
for
projective
dependency
structures
,
partition
functions
and
marginals
for
non-projective
trees
cannot
be
computed
using
dynamic-programming
methods
such
as
the
inside-outside
algorithm
.
In
this
paper
we
describe
how
these
quantities
can
be
computed
by
adapting
a
well-known
result
in
graph
theory
:
Kirchhoff
's
Matrix-Tree
Theorem
(
Tutte
,
1984
)
.
A
naive
application
of
the
theorem
yields
O
(
n4
)
and
O
(
n6
)
algorithms
for
computation
of
the
partition
function
and
marginals
,
respectively
.
However
,
our
adaptation
finds
the
partition
function
and
marginals
in
O
(
n3
)
time
using
simple
matrix
determinant
and
inversion
operations
.
We
demonstrate
an
application
of
the
new
inference
algorithm
to
non-projective
dependency
parsing
.
Specifically
,
we
show
how
to
implement
two
popular
supervised
learning
approaches
for
this
task
:
globally-normalized
log-linear
models
and
max-margin
models
.
Log-linear
estimation
critically
depends
on
the
calculation
of
partition
functions
and
marginals
,
which
can
be
computed
by
our
algorithms
.
For
max-margin
models
,
Bartlett
et
al.
(
2004
)
have
provided
a
simple
training
algorithm
,
based
on
exponentiated-gradient
(
EG
)
updates
,
that
requires
computation
of
marginals
and
can
thus
be
implemented
within
our
framework
.
Both
of
these
methods
explicitly
minimize
the
loss
incurred
when
parsing
the
entire
training
set
.
This
contrasts
with
the
online
learning
algorithms
used
in
previous
work
with
spanning-tree
models
(
McDonald
et
al.
,
2005b
)
.
We
applied
the
above
two
marginal-based
training
algorithms
to
six
languages
with
varying
degrees
of
non-projectivity
,
using
datasets
obtained
from
the
CoNLL-X
shared
task
(
Buchholz
and
Marsi
,
2006
)
.
Our
experimental
framework
compared
three
training
approaches
:
log-linear
models
,
max-margin
models
,
and
the
averaged
perceptron
.
Each
of
these
was
applied
to
both
projective
and
non-projective
parsing
.
Our
results
demonstrate
that
marginal-based
training
yields
models
which
out
-
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
141-150
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
perform
those
trained
using
the
averaged
perceptron
.
In
summary
,
the
contributions
of
this
paper
are
:
We
introduce
algorithms
for
inside-outside-style
calculations
for
directed
spanning
trees
,
or
equivalently
non-projective
dependency
structures
.
These
algorithms
should
have
wide
applicability
in
learning
problems
involving
spanning-tree
structures
.
We
illustrate
the
utility
of
these
algorithms
in
log-linear
training
of
dependency
parsing
models
,
and
show
improvements
in
accuracy
when
compared
to
averaged-perceptron
training
.
We
also
train
max-margin
models
for
dependency
parsing
via
an
EG
algorithm
(
Bartlett
et
al.
,
2004
)
.
The
experiments
presented
here
constitute
the
first
application
of
this
algorithm
to
a
large-scale
problem
.
We
again
show
improved
performance
over
the
perceptron
.
The
goal
of
our
experiments
is
to
give
a
rigorous
comparative
study
of
the
marginal-based
training
algorithms
and
a
highly-competitive
baseline
,
the
averaged
perceptron
,
using
the
same
feature
sets
for
all
approaches
.
We
stress
,
however
,
that
the
purpose
of
this
work
is
not
to
give
competitive
performance
on
the
CoNLL
data
sets
;
this
would
require
further
engineering
of
the
approach
.
Similar
adaptations
of
the
Matrix-Tree
Theorem
have
been
developed
independently
and
simultaneously
by
Smith
and
Smith
(
2007
)
and
McDonald
and
Satta
(
2007
)
;
see
Section
5
for
more
discussion
.
2
Background
2.1
Discriminative
Dependency
Parsing
Dependency
parsing
is
the
task
of
mapping
a
sentence
x
to
a
dependency
structure
y.
Given
a
sentence
x
with
n
words
,
a
dependency
for
that
sentence
is
a
tuple
(
h
,
m
)
where
h
e
[
0
.
.
.
n
]
is
the
index
of
the
head
word
in
the
sentence
,
and
m
e
[
1
.
.
.
n
]
is
the
index
of
a
modifier
word
.
The
value
h
=
0
is
a
special
root-symbol
that
may
only
appear
as
the
head
of
a
dependency
.
We
use
D
(
x
)
to
refer
to
all
possible
dependencies
for
a
sentence
x
:
D
(
x
)
=
{
(
h
,
m
)
:
h
e
[
0
.
.
.
n
]
,
m
e
[
1
.
.
.
n
]
}
.
A
dependency
parse
is
a
set
of
dependencies
that
forms
a
directed
tree
,
with
the
sentence
's
root-symbol
as
its
root
.
We
will
consider
both
projective
Projective
Non-projective
Multi
Root
root
He
saw
her
root
He
saw
her
Figure
1
:
Examples
of
the
four
types
of
dependency
structures
.
We
draw
dependency
arcs
from
head
to
modifier
.
trees
,
where
dependencies
are
not
allowed
to
cross
,
and
non-projective
trees
,
where
crossing
dependencies
are
allowed
.
Dependency
annotations
for
some
languages
,
for
example
Czech
,
can
exhibit
a
significant
number
of
crossing
dependencies
.
In
addition
,
we
consider
both
single-root
and
multi-root
trees
.
In
a
single-root
tree
y
,
the
root-symbol
has
exactly
one
child
,
while
in
a
multi-root
tree
,
the
root-symbol
has
one
or
more
children
.
This
distinction
is
relevant
as
our
training
sets
include
both
single-root
corpora
(
in
which
all
trees
are
single-root
structures
)
and
multi-root
corpora
(
in
which
some
trees
are
multi-root
structures
)
.
The
two
distinctions
described
above
are
orthogonal
,
yielding
four
classes
of
dependency
structures
;
see
Figure
1
for
examples
of
each
kind
of
structure
.
We
use
Tps
(
x
)
to
denote
the
set
of
all
possible
pro-jective
single-root
dependency
structures
for
a
sentence
x
,
and
Tfp
(
x
)
to
denote
the
set
of
single-root
non-projective
structures
for
x.
The
sets
Tpm
(
x
)
and
(
x
)
are
defined
analogously
for
multi-root
structures
.
In
contexts
where
any
class
of
dependency
structures
may
be
used
,
we
use
the
notation
T
(
x
)
as
a
placeholder
that
may
be
defined
as
Tps
(
x
)
,
T
?
p
(
x
)
,
Tpm
(
x
)
or
Tm
(
x
)
.
Following
McDonald
et
al.
(
2005a
)
,
we
use
a
discriminative
model
for
dependency
parsing
.
Features
in
the
model
are
defined
through
a
function
f
(
x
,
h
,
m
)
which
maps
a
sentence
x
together
with
a
dependency
(
h
,
m
)
to
a
feature
vector
in
Rd.
A
feature
vector
can
be
sensitive
to
any
properties
of
the
triple
(
x
,
h
,
m
)
.
Given
a
parameter
vector
w
,
the
optimal
dependency
structure
for
a
sentence
x
is
where
the
set
T
(
x
)
can
be
defined
as
Tps
(
x
)
,
Tfp
(
x
)
,
Tpm
(
x
)
or
(
x
)
,
depending
on
the
type
of
parsing
.
The
parameters
w
will
be
learned
from
a
training
set
{
(
xj
,
yi
)
where
each
xi
is
a
sentence
and
each
yi
is
a
dependency
structure
.
Much
of
the
previous
work
on
learning
w
has
focused
on
training
local
models
(
see
Section
5
)
.
McDonald
et
al.
(
2005a
;
2005b
)
trained
global
models
using
online
algorithms
such
as
the
perceptron
algorithm
or
MIRA
.
In
this
paper
we
consider
training
algorithms
based
on
work
in
conditional
random
fields
(
CRFs
)
(
Laf-ferty
et
al.
,
2001
)
and
max-margin
methods
(
Taskar
et
al.
,
2004a
)
.
2.2
Three
Inference
Problems
This
section
highlights
three
inference
problems
which
arise
in
training
and
decoding
discriminative
dependency
parsers
,
and
which
are
central
to
the
approaches
described
in
this
paper
.
Assume
that
we
have
a
vector
6
with
values
0h
,
m
G
R
for
all
(
h
,
m
)
G
D
(
x
)
;
these
values
correspond
to
weights
on
the
different
dependencies
in
D
(
x
)
.
Define
a
conditional
distribution
over
all
dependency
structures
y
G
T
(
x
)
as
follows
:
The
function
Z
(
x
;
6
)
is
commonly
referred
to
as
the
partition
function
.
The
inference
problems
are
then
as
follows
:
Problem
2
:
Computation
of
the
Partition
Function
:
Calculate
Z
(
x
;
6
)
.
Problem
3
:
Computation
of
the
Marginals
:
For
all
(
h
,
m
)
e
D
(
x
)
,
calculate
/
x
/
^m
(
x
;
6
)
.
Note
that
all
three
problems
require
a
maximization
or
summation
over
the
set
T
(
x
)
,
which
is
exponential
in
size
.
There
is
a
clear
motivation
for
being
able
to
solve
Problem
1
:
by
setting
9h
m
=
w
•
f
(
x
,
h
,
m
)
,
the
optimal
dependency
structure
y
*
(
x
;
w
)
(
see
Eq
.
1
)
can
be
computed
.
In
this
paper
the
motivation
for
solving
Problems
2
and
3
arises
from
training
algorithms
for
discriminative
models
.
As
we
will
describe
in
Section
4
,
both
log-linear
and
max-margin
models
can
be
trained
via
methods
that
make
direct
use
of
algorithms
for
Problems
2
and
3
.
In
the
case
of
projective
dependency
structures
(
i.e.
,
T
(
x
)
defined
as
Tps
(
x
)
or
Tpm
(
x
)
)
,
there
are
well-known
algorithms
for
all
three
inference
problems
.
Decoding
can
be
carried
out
using
Viterbi-style
dynamic-programming
algorithms
,
for
example
the
O
(
n3
)
algorithm
of
Eisner
(
1996
)
.
Computation
of
the
marginals
and
partition
function
can
also
be
achieved
in
O
(
n3
)
time
,
using
a
variant
of
the
inside-outside
algorithm
(
Baker
,
1979
)
applied
to
the
Eisner
(
1996
)
data
structures
(
Paskin
,
2001
)
.
In
the
non-projective
case
(
i.e.
,
T
(
x
)
defined
as
Tnp
(
x
)
or
7^
(
x
)
)
,
McDonald
et
al.
(
2005b
)
describe
how
the
CLE
algorithm
(
Chu
and
Liu
,
1965
;
Edmonds
,
1967
)
can
be
used
for
decoding
.
However
,
it
is
not
possible
to
compute
the
marginals
and
partition
function
using
the
inside-outside
algorithm
.
We
next
describe
a
method
for
computing
these
quantities
in
O
(
n3
)
time
using
matrix
inverse
and
determinant
operations
.
3
Spanning-tree
inference
using
the
Matrix-Tree
Theorem
In
this
section
we
present
algorithms
for
computing
the
partition
function
and
marginals
,
as
defined
in
Section
2.2
,
for
non-projective
parsing
.
We
first
reiterate
the
observation
of
McDonald
et
al.
(
2005a
)
that
non-projective
parses
correspond
to
directed
spanning
trees
on
a
complete
directed
graph
of
n
nodes
,
where
n
is
the
length
of
the
sentence
.
The
above
inference
problems
thus
involve
summation
over
the
set
of
all
directed
spanning
trees
.
Note
that
this
set
is
exponentially
large
,
and
there
is
no
obvious
method
for
decomposing
the
sum
into
dynamic-programming-like
subproblems
.
This
section
describes
how
a
variant
of
Kirchhoff
's
Matrix-Tree
Theorem
(
Tutte
,
1984
)
can
be
used
to
evaluate
the
partition
function
and
marginals
efficiently
.
Let
the
weight
of
a
dependency
structure
y
e
7fp
(
x
)
be
defined
as
:
In
the
remainder
of
this
section
,
we
drop
the
nota-tional
dependence
on
x
for
brevity
.
The
original
Matrix-Tree
Theorem
addressed
the
problem
of
counting
the
number
of
undirected
spanning
trees
in
an
undirected
graph
.
For
the
models
we
study
here
,
we
require
a
sum
of
weighted
and
directed
spanning
trees
.
Tutte
(
1984
)
extended
the
Matrix-Tree
Theorem
to
this
case
.
We
briefly
summarize
his
method
below
.
determinant
of
the
matrix
formed
by
deleting
row
h
and
column
m
from
X.
Finally
,
define
the
weight
of
any
directed
spanning
tree
of
G
to
be
the
product
of
the
weights
Ah
m
(
0
)
for
the
edges
in
that
tree
.
3.1
Partition
functions
via
matrix
determinants
From
Theorem
1
,
it
directly
follows
that
The
above
would
require
calculating
n
determinants
,
resulting
in
O
(
n4
)
complexity
.
However
,
as
we
show
below
Z
(
O
)
may
be
obtained
in
O
(
n3
)
time
using
a
single
determinant
evaluation
.
Define
a
new
matrix
L
(
O
)
to
be
L
(
O
)
with
the
first
row
replaced
by
the
root-selection
scores
:
This
matrix
allows
direct
computation
of
the
partition
function
,
as
the
following
proposition
shows
.
Proposition
1
The
partition
function
in
Eq
.
5
is
given
by
Z
(
0
)
=
|
L
(
0
)
|
.
3.2
Marginals
via
matrix
inversion
The
marginals
we
require
are
given
by
To
calculate
these
marginals
efficiently
for
all
values
of
(
h
,
m
)
we
use
a
well-known
identity
relating
the
log
partition-function
to
marginals
Since
the
partition
function
in
this
case
has
a
closed-form
expression
(
i.e.
,
the
determinant
of
a
matrix
constructed
from
0
)
,
the
marginals
can
also
obtained
in
closed
form
.
Using
the
chain
rule
,
the
derivative
of
the
log
partition-function
in
Proposition
1
is
To
perform
the
derivative
,
we
use
the
identity
where
5h
)
TO
is
the
Kronecker
delta
.
Thus
,
the
complexity
of
evaluating
all
the
relevant
marginals
is
dominated
by
the
matrix
inversion
,
and
the
total
complexity
is
therefore
O
(
n3
)
.
In
the
case
of
multiple
roots
,
we
can
still
compute
the
partition
function
and
marginals
efficiently
.
In
fact
,
the
derivation
of
this
case
is
simpler
than
for
single-root
structures
.
Create
an
extended
graph
G
'
which
augments
G
with
a
dummy
root
node
that
has
edges
pointing
to
all
of
the
existing
nodes
,
weighted
by
the
appropriate
root-selection
scores
.
Note
that
there
is
a
bijection
between
directed
spanning
trees
of
G
'
rooted
at
the
dummy
root
and
multi-root
structures
y
e
(
x
)
.
Thus
,
Theorem
1
can
be
used
to
compute
the
partition
function
directly
:
construct
a
Laplacian
matrix
L
(
O
)
for
G
'
and
compute
the
minor
L
(
0,0
)
(
O
)
.
Since
this
minor
is
also
a
determinant
,
the
marginals
can
be
obtained
analogously
to
the
single-root
case
.
More
concretely
,
this
technique
corresponds
to
defining
the
matrix
L
(
O
)
as
where
diag
(
v
)
is
the
diagonal
matrix
with
the
vector
v
on
its
diagonal
.
The
techniques
above
extend
easily
to
the
case
where
dependencies
are
labeled
.
For
a
model
with
L
different
labels
,
it
suffices
to
define
the
edge
and
root
scores
as
Ah
,
m
(
0
)
=
Ei
=
i
exp
{
^m
/
}
and
rm
(
0
)
=
EL
=
1
exp
{
^0
,
m
,
^
}
.
The
partition
function
over
labeled
trees
is
obtained
by
operating
on
these
values
as
described
previously
,
and
the
marginals
are
given
by
an
application
of
the
chain
rule
.
Both
inference
problems
are
solvable
in
O
(
n3
+
Ln2
)
time
.
4
Training
Algorithms
This
section
describes
two
methods
for
parameter
estimation
that
rely
explicitly
on
the
computation
of
the
partition
function
and
marginals
.
4.1
Log-Linear
Estimation
where
Z
(
x
;
w
)
is
the
partition
function
,
a
sum
over
T
/
(
x
)
,
T4
,
(
x
)
,
7pm
(
x
)
or
(
x
)
.
The
parameter
C
&gt;
0
is
a
constant
dictating
the
level
of
regularization
in
the
model
.
Since
L
(
w
)
is
a
convex
function
,
gradient
descent
methods
can
be
used
to
search
for
the
global
minimum
.
Such
methods
typically
involve
repeated
computation
of
the
loss
L
(
w
)
and
gradient
,
requiring
efficient
implementations
of
both
functions
.
Note
that
the
log-probability
of
a
parse
is
so
that
the
main
issue
in
calculating
the
loss
function
L
(
w
)
is
the
evaluation
of
the
partition
functions
Z
(
x
»
;
w
)
.
The
gradient
of
the
loss
is
given
by
is
the
marginal
probability
of
a
dependency
(
h
,
m
)
.
Thus
,
the
main
issue
in
the
evaluation
ofthe
gradient
is
the
computation
of
the
marginals
/
th
)
m
(
x
;
w
)
.
Note
that
Eq
.
7
forms
a
special
case
of
the
log-linear
distribution
defined
in
Eq
.
2
in
Section
2.2
.
If
we
set
#h
)
m
=
w
•
f
(
x
,
h
,
m
)
then
we
have
P
(
y
|
x
;
w
)
=
P
(
y
|
x
;
0
)
,
Z
(
x
;
w
)
=
Z
(
x
;
0
)
,
and
/
h
)
m
(
x
;
w
)
=
/
xh
)
m
(
x
;
0
)
.
Thus
in
the
projective
case
the
inside-outside
algorithm
can
be
used
to
calculate
the
partition
function
and
marginals
,
thereby
enabling
training
of
a
log-linear
model
;
in
the
non-projective
case
the
algorithms
in
Section
3
can
be
used
for
this
purpose
.
4.2
Max-Margin
Estimation
The
second
learning
algorithm
we
consider
is
the
large-margin
approach
for
structured
prediction
(
Taskar
et
al.
,
2004a
;
Taskar
et
al.
,
2004b
)
.
Learning
in
this
framework
again
involves
minimization
of
a
convex
function
L
(
w
)
.
Let
the
marginfor
parse
tree
y
on
the
i'th
training
example
be
defined
as
where
Ei
y
is
a
measure
of
the
loss
—
or
number
of
errors
—
for
parse
y
on
the
i'th
training
sentence
.
In
this
paper
we
take
Ei
y
to
be
the
number
of
incorrect
dependencies
in
the
parse
tree
y
when
compared
to
the
gold-standard
parse
tree
yi
.
mi
)
yi
(
w
)
=
0
,
so
that
the
hinge
loss
is
always
nonnegative
.
In
addition
,
the
hinge
loss
is
0
if
and
only
if
mi
)
y
(
w
)
&gt;
Ei
y
for
all
y
e
T
(
xi
)
.
Thus
the
hinge
loss
directly
penalizes
margins
mi
)
y
(
w
)
which
are
less
than
their
corresponding
losses
Ei
)
2
/
.
Figure
2
shows
an
algorithm
for
minimizing
L
(
w
)
that
is
based
on
the
exponentiated-gradient
algorithm
for
large-margin
optimization
described
by
Bartlett
et
al.
(
2004
)
.
The
algorithm
maintains
a
set
of
weights
#i
)
h
)
m
for
i
=
1
.
.
.
N
,
(
h
,
m
)
e
D
(
xi
)
,
which
are
updated
example-by-example
.
The
algorithm
relies
on
the
repeated
computation
of
marginal
values
/
xi
)
h
)
m
,
which
are
defined
as
follows
:
1
A
similar
definition
is
used
to
derive
marginal
values
/
4hTO
from
the
values
ffi
h
m.
Computation
of
the
/
x
and
/
/
values
is
again
inference
of
the
form
described
in
Problem
3
in
Section
2.2
,
and
can
be
Bartlett
et
al.
(
2004
)
write
P
(
y
\
xi
)
as
aiy
.
The
ai
&gt;
y
variables
are
dual
variables
that
appear
in
the
dual
objective
function
,
i.e.
,
the
convex
dual
of
L
(
w
)
.
Analysis
of
the
algorithm
shows
that
as
the
0i
&gt;
h
&gt;
m
variables
are
updated
,
the
dual
variables
converge
to
the
optimal
point
of
the
dual
objective
,
and
the
parameters
w
converge
to
the
minimum
of
L
(
w
)
.
Inputs
:
Training
examples
{
(
xi
,
yi
)
}
!
iL1
.
Parameters
:
Regularization
constant
C
,
starting
point
/
?
,
number
of
passes
over
training
set
T.
Data
Structures
:
Real
values
9i
&gt;
h
,
m
and
li
&gt;
h
,
m
for
i
=
1
.
.
.
N
,
(
h
,
m
)
e
D
(
xi
)
.
Learning
rate
7
.
where
&lt;
5i
,
h
,
m
=
(
1
-
li
,
h
,
m
-
/
ti
,
h
,
m
)
and
the
/
c
,
h
,
m
values
are
calculated
from
the
9i
&gt;
h
,
m
values
as
described
in
Eq
.
Algorithm
:
Repeat
T
passes
over
the
training
set
,
where
each
pass
is
as
follows
:
•
For
example
i
,
calculate
marginals
/
j
&gt;
i
&gt;
h
,
m
from
9i
&gt;
h
,
m
values
,
and
marginals
/
i
h
m
from
9i
&gt;
h
&gt;
m
values
(
see
Eq
.
8
)
.
•
Update
the
parameters
:
Output
:
Parameter
values
w
Figure
2
:
The
EG
Algorithm
for
Max-Margin
Estimation
.
The
learning
rate
7
is
halved
each
time
the
dual
objective
function
(
see
(
Bartlett
et
al.
,
2004
)
)
fails
to
increase
.
In
our
experiments
we
chose
/
?
=
9
,
which
was
found
to
work
well
during
development
of
the
algorithm
.
achieved
using
the
inside-outside
algorithm
for
pro-jective
structures
,
and
the
algorithms
described
in
Section
3
for
non-projective
structures
.
5
Related
Work
Global
log-linear
training
has
been
used
in
the
context
of
PCFG
parsing
(
Johnson
,
2001
)
.
Riezler
et
al.
(
2004
)
explore
a
similar
application
of
log-linear
models
to
LFG
parsing
.
Max-margin
learning
has
been
applied
to
PCFG
parsing
by
Taskar
et
al.
(
2004b
)
.
They
show
that
this
problem
has
a
QP
dual
of
polynomial
size
,
where
the
dual
variables
correspond
to
marginal
probabilities
of
CFG
rules
.
A
similar
QP
dual
may
be
obtained
for
max-margin
projective
dependency
parsing
.
However
,
for
non-projective
parsing
,
the
dual
QP
would
require
an
exponential
number
of
constraints
on
the
dependency
marginals
(
Chopra
,
1989
)
.
Nevertheless
,
alternative
optimization
methods
like
that
of
Tsochantaridis
et
al.
(
2004
)
,
or
the
EG
method
presented
here
,
can
still
be
applied
.
The
majority
of
previous
work
on
dependency
parsing
has
focused
on
local
(
i.e.
,
classification
of
individual
edges
)
discriminative
training
methods
(
Yamada
and
Matsumoto
,
2003
;
Nivre
et
al.
,
2004
;
Y.
Cheng
,
2005
)
.
Non-local
(
i.e.
,
classification
of
entire
trees
)
training
methods
were
used
by
McDonald
et
al.
(
2005a
)
,
who
employed
online
learning
.
Dependency
parsing
accuracy
can
be
improved
by
allowing
second-order
features
,
which
consider
more
than
one
dependency
simultaneously
.
McDonald
and
Pereira
(
2006
)
define
a
second-order
dependency
parsing
model
in
which
interactions
between
adjacent
siblings
are
allowed
,
and
Carreras
(
2007
)
defines
a
second-order
model
that
allows
grandparent
and
sibling
interactions
.
Both
authors
give
polytime
algorithms
for
exact
projective
parsing
.
By
adapting
the
inside-outside
algorithm
to
these
models
,
partition
functions
and
marginals
can
be
computed
for
second-order
projective
structures
,
allowing
log-linear
and
max-margin
training
to
be
applied
via
the
framework
developed
in
this
paper
.
For
higher-order
non-projective
parsing
,
however
,
computational
complexity
results
(
McDonald
and
Pereira
,
2006
;
McDonald
and
Satta
,
2007
)
indicate
that
exact
solutions
to
the
three
inference
problems
of
Section
2.2
will
be
intractable
.
Exploration
of
approximate
second-order
non-projective
inference
is
a
natural
avenue
for
future
research
.
Two
other
groups
of
authors
have
independently
and
simultaneously
proposed
adaptations
of
the
Matrix-Tree
Theorem
for
structured
inference
on
directed
spanning
trees
(
McDonald
and
Satta
,
2007
;
Smith
and
Smith
,
2007
)
.
There
are
some
algorithmic
differences
between
these
papers
and
ours
.
First
,
we
define
both
multi-root
and
single-root
algorithms
,
whereas
the
other
papers
only
consider
multi-root
parsing
.
This
distinction
can
be
important
as
one
often
expects
a
dependency
structure
to
have
exactly
one
child
attached
to
the
root-symbol
,
as
is
the
case
in
a
single-root
structure
.
Second
,
McDonald
and
Satta
(
2007
)
propose
an
O
(
n5
)
algorithm
for
computing
the
marginals
,
as
opposed
to
the
O
(
n3
)
matrix-inversion
approach
used
by
Smith
and
Smith
(
2007
)
and
ourselves
.
In
addition
to
the
algorithmic
differences
,
both
groups
of
authors
consider
applications
of
the
Matrix-Tree
Theorem
which
we
have
not
discussed
.
For
example
,
both
papers
propose
minimum-risk
decoding
,
and
McDonald
and
Satta
(
2007
)
discuss
unsupervised
learning
and
language
modeling
,
while
Smith
and
Smith
(
2007
)
define
hidden-variable
models
based
on
spanning
trees
.
In
this
paper
we
used
EG
training
methods
only
for
max-margin
models
(
Bartlett
et
al.
,
2004
)
.
However
,
Globerson
et
al.
(
2007
)
have
recently
shown
how
EG
updates
can
be
applied
to
efficient
training
of
log-linear
models
.
6
Experiments
on
Dependency
Parsing
In
this
section
,
we
present
experimental
results
applying
our
inference
algorithms
for
dependency
parsing
models
.
Our
primary
purpose
is
to
establish
comparisons
along
two
relevant
dimensions
:
projective
training
vs.
non-projective
training
,
and
marginal-based
training
algorithms
vs.
the
averaged
perceptron
.
The
feature
representation
and
other
relevant
dimensions
are
kept
fixed
in
the
experiments
.
We
used
data
from
the
CoNLL-X
shared
task
on
multilingual
dependency
parsing
(
Buchholz
and
Marsi
,
2006
)
.
In
our
experiments
,
we
used
a
subset
consisting
of
six
languages
;
Table
1
gives
details
of
the
data
sets
used.2
For
each
language
we
created
a
validation
set
that
was
a
subset
of
the
CoNLL-X
2Our
subset
includes
the
two
languages
with
the
lowest
accuracy
in
the
CoNLL-X
evaluations
(
Turkish
and
Arabic
)
,
the
language
with
the
highest
accuracy
(
Japanese
)
,
the
most
non-projective
language
(
Dutch
)
,
a
moderately
non-projective
language
(
Slovene
)
,
and
a
highly
projective
language
(
Spanish
)
.
All
languages
but
Spanish
have
multi-root
parses
in
their
data
.
We
are
grateful
to
the
providers
of
the
treebanks
that
constituted
the
data
of
our
experiments
(
Hajic
et
al.
,
2004
;
van
der
Beek
et
al.
,
2002
;
Kawata
and
Bartels
,
2000
;
Dzeroski
et
al.
,
2006
;
Civit
and
Marti
,
2002
;
Oflazer
et
al.
,
2003
)
.
language
val
.
Japanese
Table
1
:
Information
for
the
languages
in
our
experiments
.
The
2nd
column
(
%
cd
)
is
the
percentage
of
crossing
dependencies
in
the
training
and
validation
sets
.
The
last
three
columns
report
the
size
in
tokens
of
the
training
,
validation
and
test
sets
.
training
set
for
that
language
.
The
remainder
of
each
training
set
was
used
to
train
the
models
for
the
different
languages
.
The
validation
sets
were
used
to
tune
the
meta-parameters
(
e.g.
,
the
value
of
the
reg-ularization
constant
C
)
of
the
different
training
algorithms
.
We
used
the
official
test
sets
and
evaluation
script
from
the
CoNLL-X
task
.
All
of
the
results
that
we
report
are
for
unlabeled
dependency
parsing.3
The
non-projective
models
were
trained
on
the
CoNLL-X
data
in
its
original
form
.
Since
the
pro-jective
models
assume
that
the
dependencies
in
the
data
are
non-crossing
,
we
created
a
second
training
set
for
each
language
where
non-projective
dependency
structures
were
automatically
transformed
into
projective
structures
.
All
projective
models
were
trained
on
these
new
training
sets.4
Our
feature
space
is
based
on
that
of
McDonald
et
al.
(
2005a
)
.
We
performed
experiments
using
three
training
algorithms
:
the
averaged
perceptron
(
Collins
,
2002
)
,
log-linear
training
(
via
conjugate
gradient
descent
)
,
and
max-margin
training
(
via
the
EG
algorithm
)
.
Each
of
these
algorithms
was
trained
using
pro-jective
and
non-projective
methods
,
yielding
six
training
settings
per
language
.
The
different
training
algorithms
have
various
meta-parameters
,
which
we
optimized
on
the
validation
set
for
each
language
/
training-setting
combination
.
The
3Our
algorithms
also
support
labeled
parsing
(
see
Section
3.4
)
.
Initial
experiments
with
labeled
models
showed
the
same
trend
that
we
report
here
for
unlabeled
parsing
,
so
for
simplicity
we
conducted
extensive
experiments
only
for
unlabeled
parsing
.
4The
transformations
were
performed
by
running
the
pro-jective
parser
with
score
+1
on
correct
dependencies
and
-1
otherwise
:
the
resulting
trees
are
guaranteed
to
be
projective
and
to
have
a
minimum
loss
with
respect
to
the
correct
tree
.
Note
that
only
the
training
sets
were
transformed
.
5It
should
be
noted
that
McDonald
et
al.
(
2006
)
use
a
richer
feature
set
that
is
incomparable
to
our
features
.
Perceptron
Max-Margin
Log-Linear
Table
2
:
Test
data
results
.
The
p
and
np
columns
show
results
with
projective
and
non-projective
training
respectively
.
Table
3
:
Results
for
the
three
training
algorithms
on
the
different
languages
(
P
=
perceptron
,
E
=
EG
,
L
=
log-linear
models
)
.
AV
is
an
average
across
the
results
for
the
different
languages
.
averaged
perceptron
has
a
single
meta-parameter
,
namely
the
number
of
iterations
over
the
training
set
.
The
log-linear
models
have
two
meta-parameters
:
the
regularization
constant
C
and
the
number
of
gradient
steps
T
taken
by
the
conjugate-gradient
optimizer
.
The
EG
approach
also
has
two
metaparameters
:
the
regularization
constant
C
and
the
number
of
iterations
,
T.6
For
models
trained
using
non-projective
algorithms
,
both
projective
and
non-projective
parsing
was
tested
on
the
validation
set
,
and
the
highest
scoring
of
these
two
approaches
was
then
used
to
decode
test
data
sentences
.
Table
2
reports
test
results
for
the
six
training
scenarios
.
These
results
show
that
for
Dutch
,
which
is
the
language
in
our
data
that
has
the
highest
number
of
crossing
dependencies
,
non-projective
training
gives
significant
gains
over
projective
training
for
all
three
training
methods
.
For
the
other
languages
,
non-projective
training
gives
similar
or
even
improved
performance
over
projective
training
.
Table
3
gives
an
additional
set
of
results
,
which
were
calculated
as
follows
.
For
each
of
the
three
training
methods
,
we
used
the
validation
set
results
to
choose
between
projective
and
non-projective
training
.
This
allows
us
to
make
a
direct
comparison
of
the
three
training
algorithms
.
Table
3
6We
trained
the
perceptron
for
100
iterations
,
and
chose
the
iteration
which
led
to
the
best
score
on
the
validation
set
.
Note
that
in
all
of
our
experiments
,
the
best
perceptron
results
were
actually
obtained
with
30
or
fewer
iterations
.
For
the
log-linear
and
EG
algorithms
we
tested
a
number
of
values
for
C
,
and
for
each
value
of
C
ran
100
gradient
steps
or
EG
iterations
,
finally
choosing
the
best
combination
of
C
and
T
found
in
validation
.
shows
the
results
of
this
comparison.7
The
results
show
that
log-linear
and
max-margin
models
both
give
a
higher
average
accuracy
than
the
perceptron
.
For
some
languages
(
e.g.
,
Japanese
)
,
the
differences
from
the
perceptron
are
small
;
however
for
other
languages
(
e.g.
,
Arabic
,
Dutch
or
Slovene
)
the
improvements
seen
are
quite
substantial
.
7
Conclusions
This
paper
describes
inference
algorithms
for
spanning-tree
distributions
,
focusing
on
the
fundamental
problems
of
computing
partition
functions
and
marginals
.
Although
we
concentrate
on
loglinear
and
max-margin
estimation
,
the
inference
algorithms
we
present
can
serve
as
black-boxes
in
many
other
statistical
modeling
techniques
.
Our
experiments
suggest
that
marginal-based
training
produces
more
accurate
models
than
per-ceptron
learning
.
Notably
,
this
is
the
first
large-scale
application
of
the
EG
algorithm
,
and
shows
that
it
is
a
promising
approach
for
structured
learning
.
In
line
with
McDonald
et
al.
(
2005b
)
,
we
confirm
that
spanning-tree
models
are
well-suited
to
dependency
parsing
,
especially
for
highly
non-projective
languages
such
as
Dutch
.
Moreover
,
spanning-tree
models
should
be
useful
for
a
variety
of
other
problems
involving
structured
data
.
Acknowledgments
The
authors
would
like
to
thank
the
anonymous
reviewers
for
their
constructive
comments
.
In
addition
,
the
authors
gratefully
acknowledge
the
following
sources
of
support
.
Terry
Koo
was
funded
by
from
NTT
,
Agmt
.
Amir
Glober-son
was
supported
by
a
fellowship
from
the
Rothschild
Foundation
-
Yad
Hanadiv
.
Xavier
Carreras
was
supported
by
the
Catalan
Ministry
of
Innovation
,
Universities
and
Enterprise
,
and
a
grant
from
NTT
,
Agmt
.
Dtd.
6
/
21
/
1998
.
Michael
Collins
was
funded
by
NSF
grants
0347631
and
DMS-0434222
.
7We
ran
the
sign
test
at
the
sentence
level
to
measure
the
statistical
significance
of
the
results
aggregated
across
the
six
languages
.
Out
of
2,472
sentences
total
,
log-linear
models
gave
improved
parses
over
the
perceptron
on
448
sentences
,
and
worse
parses
on
343
sentences
.
The
max-margin
method
gave
improved
/
worse
parses
for
500
/
383
sentences
.
Both
results
are
significant
with
p
&lt;
0.001
.
