In
this
paper
,
we
describe
a
two-stage
multilingual
dependency
parser
used
for
the
multilingual
track
of
the
CoNLL
2007
shared
task
.
The
system
consists
of
two
components
:
an
unlabeled
dependency
parser
using
Gibbs
sampling
which
can
incorporate
sentence-level
(
global
)
features
as
well
as
token-level
(
local
)
features
,
and
a
dependency
relation
labeling
module
based
on
Support
Vector
Machines
.
Experimental
results
show
that
the
global
features
are
useful
in
all
the
languages
.
1
Introduction
Making
use
of
as
many
informative
features
as
possible
is
crucial
to
obtain
high
performance
in
machine
learning
based
NLP
.
Recently
,
several
methods
for
incorporating
non-local
features
have
been
investigated
,
though
such
features
often
make
models
complex
and
thus
complicate
inference
.
Collins
and
Koo
(
2005
)
proposed
a
reranking
method
for
phrase
structure
parsing
with
which
any
type
of
global
features
in
a
parse
tree
can
be
used
.
For
dependency
parsing
,
McDonald
and
Pereira
(
2006
)
proposed
a
method
which
can
incorporate
some
types
of
global
features
,
and
Riedel
and
Clarke
(
2006
)
studied
a
method
using
integer
linear
programming
which
can
incorporate
global
linguistic
constraints
.
In
this
paper
,
we
study
dependency
parsing
using
Gibbs
sampling
which
can
incorporate
any
type
of
global
feature
in
a
sentence
.
The
parser
determines
unlabeled
dependency
structures
only
,
and
we
attach
dependency
relation
labels
using
Support
Vector
Machines
afterwards
.
We
participated
in
the
multilingual
track
of
the
evaluated
the
system
on
data
sets
of
10
languages
(
Hajic
et
al.
,
2004
;
Aduriz
et
al.
,
2003
;
Marti
et
al.
,
2007
;
Chen
et
al.
,
2003
;
Bohmova
et
al.
,
2003
;
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Prokopidis
et
al.
,
2005
;
Csendes
et
al.
,
2005
;
Mon-temagni
et
al.
,
2003
;
Oflazer
et
al.
,
2003
)
.
The
rest
of
the
paper
describes
the
specification
of
the
system
and
the
evaluation
results
.
2
Unlabeled
Dependency
Parsing
using
Global
Features
2.1
Probabilistic
Model
where
QM
(
h
|
w
)
is
an
initial
distribution
,
fk
(
w
,
h
)
is
the
k-th
feature
function
,
K
is
the
number
of
feature
functions
,
and
\
k
is
the
weight
of
the
k-th
feature
.
H
(
w
)
is
the
set
of
possible
configurations
of
heads
for
a
given
sentence
w.
Although
it
is
appropriate
that
H
(
w
)
is
the
set
of
projective
trees
for
projective
languages
,
and
is
the
set
ofnon-projective
trees
(
which
is
a
superset
of
the
set
of
projective
trees
)
for
non-projective
languages
,
in
this
study
,
we
define
H
(
w
)
to
be
the
set
of
all
the
possible
graphs
,
which
contains
|
w
|
]
v
'
]
elements
.
PA
)
M
(
h
|
w
)
and
QM
(
h
|
w
)
are
defined
over
H
(
w
)
1
.
The
probability
distribution
Pa
,
m
(
h
|
w
)
is
a
joint
distribution
of
all
the
heads
conditioned
by
a
sentence
,
therefore
we
call
this
model
sentence-level
model
.
The
feature
function
fk
(
w
,
h
)
is
defined
on
a
sentence
w
with
heads
h
,
and
we
can
use
any
information
in
the
sentence
without
the
independence
assumption
for
the
heads
of
the
tokens
,
therefore
we
call
fk
(
w
,
h
)
1H
(
w
)
is
a
superset
of
the
set
of
non-projective
trees
,
and
is
an
unnecessarily
large
set
which
contains
ill-formed
dependency
trees
such
as
trees
with
cycles
.
This
issue
may
cause
reduction
of
parsing
performance
,
but
we
adopt
this
approach
for
computational
efficiency
.
sentence-level
(
global
)
feature
.
We
define
initial
distribution
QM
(
h
|
w
)
as
the
product
of
qM
(
h
\
w
,
t
)
which
is
the
probability
distribution
of
the
head
h
of
each
t-th
token
calculated
with
maximum
entropy
models
:
iwi
where
gi
(
w
,
t
,
h
)
is
the
l-th
feature
function
,
L
is
the
number
of
feature
functions
,
and
[
i
\
is
the
weight
of
the
l-th
feature
.
qM
(
h
\
w
,
t
)
is
a
model
of
the
head
of
a
single
token
,
calculated
independently
from
other
tokens
,
therefore
we
call
qM
(
h
\
w
,
t
)
token-level
model
,
and
gi
(
w
,
t
,
h
)
token-level
(
local
)
feature
.
2.2
Decoding
and
Parameter
Estimation
Let
us
consider
how
to
ind
the
optimal
solution
h
,
given
a
sentence
w
,
parameters
of
the
sentence-level
model
A
=
[
Xi
}
•
•
•
,
XK
}
,
and
parameters
of
the
token-level
model
M
=
•
•
•
,
Since
the
probabilistic
model
contains
global
features
and
eficient
algorithms
such
as
dynamic
programming
cannot
be
used
,
we
use
Gibbs
sampling
to
obtain
an
approximated
solution
.
Gibbs
sampling
can
ef-iciently
generate
samples
from
high-dimensional
probability
distributions
with
complex
dependencies
among
variables
(
Andrieu
et
al.
,
2003
)
,
and
we
assume
that
R
samples
[
h
(
i
)
,
•
•
•
,
h
(
R
)
}
are
generated
from
PA
,
M
(
h
|
w
)
using
Gibbs
sampling
.
Then
,
the
marginal
distribution
of
the
head
of
the
t-th
token
given
w
,
Pt
(
h
|
w
)
,
is
approximately
calculated
as
follows
:
where
5
(
i
,
j
)
is
the
Kronecker
delta
.
In
order
to
ind
a
solution
using
the
marginal
distribution
,
we
adopt
the
maximum
spanning
tree
(
MST
)
framework
proposed
by
McDonald
et
al.
(
2005a
)
.
In
this
framework
,
scores
for
possible
edges
in
dependency
graphs
are
deined
,
and
the
optimal
dependency
tree
is
found
as
the
MST
in
which
the
summation
of
the
edge
scores
is
maximized
.
Let
s
(
i
,
j
)
denote
the
score
of
the
edge
from
a
parent
node
(
head
)
i
to
a
child
node
(
dependent
)
j.
We
deine
s
(
i
,
j
)
as
follows
:
We
use
the
logarithm
ofthe
marginal
distributionbe-cause
the
summation
of
edge
scores
is
maximized
by
the
MST
search
algorithms
but
the
product
ofthe
marginal
distributions
should
be
maximized
.
The
best
projective
parse
tree
is
obtained
using
the
Eisner
algorithm
(
Eisner
,
1996
)
with
the
scores
,
and
the
best
non-projective
one
is
obtained
using
the
Chu-Liu-Edmonds
(
CLE
)
algorithm
(
McDonald
et
al.
,
2005b
)
.
Although
in
this
method
,
the
factored
score
s
(
i
,
j
)
is
used
to
measure
likelihood
of
dependency
trees
,
the
score
is
calculated
taking
a
whole
sentence
into
consideration
using
Gibbs
sampling
.
estimation
with
Gaussian
priors
.
We
deine
the
following
objective
function
M
:
where
a
is
a
hyper
parameter
of
Gaussian
priors
.
The
optimal
parameters
M
which
maximize
M
can
be
obtained
by
quasi-Newton
methods
such
as
the
L-BFGS
algorithm
with
above
M
and
its
partial
derivatives
.
The
parameters
of
the
sentence-level
model
A
=
[
Xi
}
•
•
•
,
XK
}
can
also
be
estimated
in
a
similar
way
with
the
following
objective
function
L
after
the
parameters
of
the
token-level
model
are
estimated
.
This
objective
function
and
its
partial
derivative
contain
summations
over
all
the
possible
configurations
which
are
difficult
to
calculate
.
We
approximately
calculate
these
values
using
static
Monte
Carlo
(
not
MCMC
)
methods
with
fixed
S
samples
{
hn
(
1
)
,
,
hn
(
S
)
}
generated
from
Qu
(
h
\
wn
)
2
:
2Static
Monte
Carlo
methods
become
inefficient
when
the
dimension
of
the
probabilistic
distribution
is
high
,
and
more
sophisticated
methods
would
be
used
for
accurate
parameter
estimation
.
The
token-level
features
used
in
the
system
are
the
same
as
those
used
in
MSTParser
version
0.4.23
.
The
features
include
lexical
forms
and
(
coarse
and
fine
)
POS
tags
of
parent
tokens
,
child
tokens
,
their
surrounding
tokens
,
and
tokens
between
the
child
and
the
parent
.
The
direction
and
the
distance
from
a
parent
to
its
child
,
and
the
FEATS
fields
of
the
parent
and
the
child
which
are
split
into
elements
and
then
combined
are
also
included
.
Features
that
appeared
less
than
5
times
in
training
data
are
ignored
.
Global
features
can
capture
any
information
in
dependency
trees
,
and
the
following
nine
types
of
global
features
are
used
(
In
the
following
,
parent
node
means
a
head
token
,
and
child
node
means
a
dependent
token
)
:
Child
Unigram+Parent+Grandparent
This
feature
template
is
a
4-tuple
consisting
of
(
1
)
a
child
node
,
(
2
)
its
parent
node
,
(
3
)
the
direction
from
the
parent
node
to
the
child
node
,
and
(
4
)
the
grandparent
node
.
Each
node
in
the
feature
template
is
expanded
to
its
lexical
form
and
coarse
POS
tag
in
order
to
obtain
actual
features
.
Features
that
appeared
in
four
or
less
sentences
are
ignored
.
The
same
procedure
is
applied
to
the
following
other
features
.
Child
Bigram+Parent
This
feature
template
is
a
4-tuple
consisting
of
(
1
)
a
child
node
,
(
2
)
its
parent
node
,
(
3
)
the
direction
from
the
parent
node
to
the
child
node
,
and
(
4
)
the
nearest
outer
sibling
node
(
the
nearest
sibling
node
which
exists
on
the
opposite
side
of
the
parent
node
)
of
the
child
node
.
This
feature
template
is
almost
the
same
as
the
one
used
by
McDonald
and
Pereira
(
2006
)
.
Child
Bigram+Parent+Grandparent
This
feature
template
is
a
5-tuple
.
The
irst
four
elements
(
1
)
-
(
4
)
are
the
same
as
the
Child
Bi-gram+Parent
feature
template
,
and
the
additional
element
(
5
)
is
the
grandparent
node
.
Child
Trigram+Parent
This
feature
template
is
a
5-tuple
.
The
irst
four
elements
(
1
)
-
(
4
)
are
the
same
as
the
Child
Bigram+Parent
feature
template
,
and
the
additional
element
(
5
)
is
the
next
nearest
outer
sibling
node
of
the
child
node
.
3http
:
/
/
sourceforge.net
/
projects
/
mstparser
Parent+All
Children
This
feature
template
is
a
tuple
with
more
than
one
element
.
The
irst
element
is
a
parent
node
,
and
the
other
elements
are
all
of
its
child
nodes
.
Parent+All
Children+Grandparent
This
feature
template
is
a
tuple
with
more
than
two
elements
.
The
elements
other
than
the
last
one
are
the
same
as
the
Parent+All
Children
feature
template
,
and
the
last
element
is
the
grandparent
node
.
Child+Ancestor
This
feature
template
is
a
2-tuple
ancestor
nodes
.
Acyclic
This
feature
type
has
one
of
two
values
,
true
if
the
dependency
tree
is
acyclic
,
or
false
otherwise
.
Projective
This
feature
type
has
one
of
two
values
,
true
if
the
dependency
tree
is
projective
,
or
false
otherwise
.
3
Dependency
Relation
Labeling
Dependency
relation
labeling
can
be
handled
as
a
multi-class
classiication
problem
,
and
we
use
Support
Vector
Machines
(
SVMs
)
which
have
been
successfully
applied
to
many
NLP
tasks
.
Solving
large-scale
multi-class
classiication
problem
with
SVMs
requires
substantial
computational
resources
,
so
we
use
the
revision
learning
method
(
Nakagawa
et
al.
,
2002
)
.
The
revision
learning
method
combines
a
probabilistic
model
which
has
smaller
computational
cost
with
a
binary
classiier
which
has
higher
generalization
capacity
.
In
the
method
,
the
latter
classiier
revises
the
output
of
the
former
model
to
conduct
multi-class
classiication
with
higher
accuracy
and
reasonable
computational
cost
.
In
this
study
,
we
use
maximum
entropy
(
ME
)
models
as
the
probabilistic
model
and
SVMs
with
the
second
order
polynomial
kernel
as
the
binary
classiier
.
The
dependency
label
of
each
node
is
determined
independently
of
the
labeling
of
other
nodes
.
Table
1
:
Results
of
Multilingual
Dependency
Parsing
Algorithm
Features
Hungarian
local
+global
Table
2
:
Unlabeled
Attachment
Scores
in
Different
Settings
(
underlined
values
indicate
submitted
results
,
and
bold
values
indicate
the
highest
scores
)
and
the
child
tokens
of
i
(
the
j
"
-
th
token
where
j
"
e
[
j
"
\
hj
»
=
i
}
)
4
.
As
the
features
for
ME
models
,
a
subset
of
them
is
used
since
ME
models
are
used
just
for
reducing
the
search
space
,
and
do
not
need
so
many
features
.
4
Results
and
Analysis
In
order
to
tune
the
system
,
we
split
each
training
data
set
into
two
parts
,
and
used
the
first
half
for
training
and
the
remaining
half
for
testing
in
development
.
The
CLE
algorithm
was
used
for
Basque
,
Czech
,
Hungarian
and
Turkish
,
and
the
Eisner
algorithm
was
used
for
the
others
.
We
used
lemmas
for
Catalan
,
Czech
,
Greek
and
Italian
,
and
word
forms
for
all
others
.
The
values
of
the
parameters
to
be
fixed
were
chosen
as
R
=
500
,
S
=
200
,
a
=
0.25
,
and
a
'
=
0.25
.
With
these
parameter
settings
,
training
took
247
hours
,
and
testing
took
343
minutes
on
an
Opteron
250
processor
.
Table
1
shows
the
evaluation
results
on
the
test
sets
.
Accuracy
was
measured
with
the
labeled
attachment
score
(
LAS
)
and
the
unlabeled
attachment
score
(
UAS
)
.
Among
the
participating
systems
in
the
shared
task
,
we
obtained
the
second
best
average
accuracy
in
the
labeled
attachment
score
,
and
the
best
average
accuracy
in
the
unlabeled
attachment
score
.
Compared
with
other
systems
,
the
gap
between
our
labeled
and
unlabeled
scores
is
relatively
big
.
In
this
study
,
labeling
of
dependency
relations
was
performed
in
a
separate
post-processing
step
,
and
each
label
was
predicted
independently
.
The
labeled
scores
may
be
improved
if
the
parsing
process
and
the
labeling
process
are
performed
at
the
same
time
,
and
dependencies
among
labels
are
taken
into
account
.
We
conducted
experiments
with
different
settings
.
Table
2
shows
the
results
measured
with
the
unla-beled
attachment
score
.
In
the
table
,
Eisner
and
4Although
polynomial
kernels
of
SVMs
can
implicitly
handle
combined
features
,
some
of
combined
features
were
also
included
explicitly
because
using
unnecessarily
high
order
polynomial
kernels
decreases
performance
.
CLE
indicate
that
the
Eisner
algorithm
and
the
CLE
algorithm
are
used
in
decoding
,
and
local
and
+global
indicate
that
local
features
alone
,
and
local
and
global
features
together
are
used
.
The
CLE
algorithm
performed
better
than
the
Eisner
algorithm
for
Basque
,
Czech
,
Hungarian
,
Italian
and
Turkish
.
All
of
these
data
sets
except
Italian
contain
relatively
a
large
number
of
non-projective
sentences
(
the
percentage
of
sentences
with
at
least
one
non-projective
relation
in
the
training
data
is
over
20
%
(
Nivre
et
al.
,
2007
)
)
,
though
the
Greek
data
set
,
on
which
the
Eisner
algorithm
performed
better
,
also
contains
many
non-projective
sentences
(
20.3
%
)
.
By
using
the
global
features
,
the
accuracy
was
improved
in
all
the
cases
except
for
Turkish
with
the
Eisner
algorithm
(
Table
2
)
.
The
increase
was
rather
large
in
Chinese
and
Czech
.
When
the
global
features
were
used
in
these
languages
,
the
dependency
accuracy
for
tokens
whose
heads
had
conjunctions
as
parts-of-speech
was
notably
improved
;
from
80.5
%
to
86.0
%
in
Chinese
(
Eisner
)
,
and
from
73.2
%
to
77.6
%
in
Czech
(
CLE
)
.
We
investigated
the
trained
global
models
,
and
found
that
Parent+All
Children
features
,
whose
parents
were
conjunctions
and
whose
children
had
compatible
classes
,
had
large
positive
weights
,
and
those
whose
children
had
incompatible
classes
had
large
negative
weights
.
A
feature
with
a
larger
weight
is
generally
more
influential
.
Riedel
and
Clarke
(
2006
)
suggested
to
use
linguistic
constraints
such
as
"
arguments
of
a
coordination
must
have
compatible
word
classes
,
"
and
such
constraint
seemed
to
be
represented
by
the
features
in
our
models
.
5
Conclusion
In
this
study
,
we
applied
a
dependency
parser
using
global
features
to
multilingual
dependency
parsing
.
Evaluation
results
showed
that
the
use
ofglobal
features
was
effective
to
obtain
higher
accuracy
in
multilingual
dependency
parsing
.
Improving
dependency
relation
labeling
is
left
for
future
work
.
