Deterministic
dependency
parsers
use
parsing
actions
to
construct
dependencies
.
These
parsers
do
not
compute
the
probability
of
the
whole
dependency
tree
.
They
only
determine
parsing
actions
stepwisely
by
a
trained
classifier
.
To
globally
model
parsing
actions
of
all
steps
that
are
taken
on
the
input
sentence
,
we
propose
two
kinds
of
probabilistic
parsing
action
models
that
can
compute
the
probability
of
the
whole
dependency
tree
.
The
tree
with
the
maximal
probability
is
outputted
.
The
experiments
are
carried
on
10
languages
,
and
the
results
show
that
our
probabilistic
parsing
action
models
outperform
the
original
deterministic
dependency
parser
.
1
Introduction
The
target
of
CoNLL
2007
shared
task
(
Nivre
et
al.
,
2007
)
is
to
parse
texts
in
multiple
languages
by
using
a
single
dependency
parser
that
has
the
capacity
to
learn
from
treebank
data
.
Among
parsers
participating
in
CoNLL
2006
shared
task
(
Buchholz
et
al.
,
2006
)
,
deterministic
dependency
parser
shows
great
efficiency
in
time
and
comparable
performances
for
multi-lingual
dependency
parsing
(
Nivre
et
al.
,
2006
)
.
Deterministic
parser
regards
parsing
as
a
sequence
of
parsing
actions
that
are
taken
step
by
step
on
the
input
sentence
.
Parsing
actions
construct
dependency
relations
between
words
.
Deterministic
dependency
parser
does
not
score
the
entire
dependency
tree
as
most
of
state-of-the-art
parsers
.
They
only
stepwisely
choose
the
most
probable
parsing
action
.
In
this
paper
,
to
globally
model
parsing
actions
of
all
steps
that
are
taken
on
the
input
sentence
,
we
propose
two
kinds
of
probabilistic
parsing
action
models
that
can
compute
the
entire
dependency
tree
's
probability
.
Experiments
are
evaluated
on
diverse
data
set
of
10
languages
provided
by
CoNLL
2007
shared-task
(
Nivre
et
al.
,
2007
)
.
Results
show
that
our
probabilistic
parsing
action
models
outperform
the
original
deterministic
dependency
parser
.
We
also
present
a
general
error
analysis
across
a
wide
set
of
languages
plus
a
detailed
error
analysis
of
Chinese
.
Next
we
briefly
introduce
the
original
deterministic
dependency
parsing
algorithm
that
is
a
basic
component
of
our
models
.
2
Introduction
of
Deterministic
Dependency
Parsing
There
are
mainly
two
representative
deterministic
dependency
parsing
algorithms
proposed
respectively
by
Nivre
(
2003
)
,
Yamada
and
Matsumoto
(
2003
)
.
Here
we
briefly
introduce
Yamada
and
Matsumoto
's
algorithm
,
which
is
adopted
by
our
models
,
to
illustrate
deterministic
dependency
parsing
.
The
other
representative
method
of
Nivre
also
parses
sentences
in
a
similar
deterministic
manner
except
different
data
structure
and
parsing
actions
.
Yamada
's
method
originally
focuses
on
unla-beled
dependency
parsing
.
Three
kinds
of
parsing
actions
are
applied
to
construct
the
dependency
between
two
focus
words
.
The
two
focus
words
are
the
current
sub
tree
's
root
and
the
succeeding
(
right
)
sub
tree
's
root
given
the
current
parsing
state
.
Every
parsing
step
results
in
a
new
parsing
state
,
which
includes
all
elements
of
the
current
partially
built
tree
.
Features
are
extracted
about
these
two
focus
words
.
In
the
training
phase
,
features
and
the
corresponding
parsing
action
compose
the
training
I
He
I
I
provides
I
I
confirming
|
evidence
I
RIGHT
I
provides
I
I
confirmingl
SmF'Ty
Iprovides
I
Confirming
I
I
evidence
I
RighT
)
&gt;
E^iil^
levidence
I
I
provides
I
confirming
evidence
Figure
1
.
The
example
of
the
parsing
process
of
Yamada
and
Matsumoto
's
method
.
The
input
sentence
is
"
He
provides
confirming
evidence
.
"
data
.
In
the
testing
phase
,
the
classifier
determines
which
parsing
action
should
be
taken
based
on
the
features
.
The
parsing
algorithm
ends
when
there
is
no
further
dependency
relation
can
be
made
on
the
whole
sentence
.
The
details
of
the
three
parsing
actions
are
as
follows
:
LEFT
:
it
constructs
the
dependency
that
the
right
focus
word
depends
on
the
left
focus
word
.
RIGHT
:
it
constructs
the
dependency
that
the
left
focus
word
depends
on
the
right
focus
word
.
SHIFT
:
it
does
not
construct
dependency
,
just
moves
the
parsing
focus
.
That
is
,
the
new
left
focus
word
is
the
previous
right
focus
word
,
whose
succeeding
sub
tree
's
root
is
the
new
right
focus
word
.
The
illustration
of
these
three
actions
and
the
parsing
process
is
presented
in
figure
1
.
Note
that
the
focus
words
are
shown
as
bold
black
box
.
We
extend
the
set
of
parsing
actions
to
do
labeled
dependency
parsing
.
LEFT
and
RIGHT
are
concatenated
by
dependency
labels
,
while
SHIFT
remains
the
same
.
For
example
in
figure
1
,
the
original
action
sequence
"
RIGHT
-
&gt;
SHIFT
-
&gt;
RIGHT
-
&gt;
LEFT
"
becomes
"
RIGHT-SBJ
-
&gt;
SHIFT
-
&gt;
RIGHT-NMOD
-
&gt;
LEFT-OBJ
"
.
3
Probabilistic
Parsing
Action
Models
Deterministic
dependency
parsing
algorithms
are
greedy
.
They
choose
the
most
probable
parsing
action
at
every
parsing
step
given
the
current
parsing
state
,
and
do
not
score
the
entire
dependency
tree
.
To
compute
the
probability
of
whole
dependency
tree
,
we
propose
two
kinds
of
probabilistic
models
that
are
defined
on
parsing
actins
:
parsing
action
chain
model
(
PACM
)
and
parsing
action
phrase
model
(
PAPM
)
.
The
parsing
process
can
be
viewed
as
a
Markov
Chain
.
At
every
parsing
step
,
there
are
several
candidate
parsing
actions
.
The
objective
of
this
model
is
to
find
the
most
probable
sequence
of
parsing
actions
by
taking
the
Markov
assumption
.
As
shown
in
figure
1
,
the
action
sequence
"
RIGHT
-
OBJ
"
constructs
the
right
dependency
tree
of
the
example
sentence
.
Choosing
this
action
sequence
among
all
candidate
sequences
is
the
objective
of
this
model
.
Where
T
denotes
the
dependency
tree
,
S
denotes
the
original
input
sentence
,
d
denotes
the
parsing
action
at
time
step
.
We
add
an
artificial
parsing
action
d0
as
initial
action
.
We
introduce
a
variable
contextd
to
denote
the
resulting
parsing
state
when
the
action
d
is
taken
on
contextd
1
.
contextd
is
the
original
input
sentence
.
Suppose
d0
.
.
.
dn
are
taken
sequentially
on
the
input
sentence
S
,
and
result
in
a
sequence
of
parsing
states
contextd0
.
.
.
contextdn
,
then
P
(
T
\
S
)
defined
in
equation
(
1
)
becomes
as
below
:
Formula
(
3
)
comes
from
formula
(
2
)
by
obeying
the
Markov
assumption
.
Note
that
formula
(
4
)
is
about
the
classifier
of
parsing
actions
.
It
denotes
the
probability
of
the
parsing
action
d
given
the
parsing
state
contextd
If
we
train
a
classifier
that
can
predict
with
probability
output
,
then
we
can
compute
P
(
T
\
S
)
by
computing
the
product
of
the
probabilities
of
parsing
actions
.
The
classifier
we
use
throughout
this
paper
is
SVM
(
Vapnik
,
1995
)
.
We
adopt
Libsvm
(
Chang
and
Lin
,
2005
)
,
which
can
train
multi-class
classifier
and
support
training
and
predicting
with
probability
output
(
Chang
and
Lin
,
2005
)
.
For
this
model
,
the
objective
is
to
choose
the
parsing
action
sequence
that
constructs
the
dependency
tree
with
the
maximal
probability
.
Because
this
model
chooses
the
most
probable
sequence
,
not
the
most
probable
parsing
action
at
only
one
step
,
it
avoids
the
greedy
property
of
the
original
deterministic
parsers
.
We
use
beam
search
for
the
decoding
of
this
model
.
We
use
m
to
denote
the
beam
size
.
Then
beam
search
is
carried
out
as
follows
.
At
every
parsing
step
,
all
parsing
states
are
ordered
(
or
partially
m
ordered
)
according
to
their
probabilities
.
Probability
of
a
parsing
state
is
determined
by
multiplying
the
probabilities
of
actions
that
generate
that
state
.
Then
we
choose
m
best
parsing
states
for
this
step
,
and
next
parsing
step
only
consider
these
m
best
parsing
states
.
Parsing
terminates
when
the
first
entire
dependency
tree
is
constructed
.
To
obtain
a
list
of
n-best
parses
,
we
simply
continue
parsing
until
either
n
trees
are
found
,
or
no
further
parsing
can
be
fulfilled
.
In
the
Parsing
Action
Chain
Model
(
PACM
)
,
actions
are
competing
at
every
parsing
step
.
Only
m
best
parsing
states
resulted
by
the
corresponding
actions
are
kept
for
every
step
.
But
for
the
parsing
problem
,
it
is
reasonable
that
actions
are
competing
for
which
phrase
should
be
built
.
For
dependency
syntax
,
one
phrase
consists
of
the
head
word
and
all
its
children
.
Based
on
this
motivation
,
we
propose
Parsing
Action
Phrase
Model
(
PAPM
)
,
which
divides
parsing
actions
into
two
classes
:
constructing
action
and
shifting
action
.
If
a
phrase
is
built
after
an
action
is
performed
,
the
action
is
called
constructing
action
.
In
original
Yamada
's
algorithm
,
constructing
actions
are
LEFT
and
RIGHT
.
For
example
,
if
LEFT
is
taken
,
it
indicates
that
the
right
focus
word
has
found
all
its
children
and
becomes
the
head
of
this
new
phrase
.
Note
that
one
word
with
no
children
can
also
be
viewed
as
a
phrase
if
its
dependency
on
other
word
is
constructed
.
In
the
extended
set
of
parsing
actions
for
labeled
parsing
,
compound
actions
,
which
consist
of
LEFT
and
RIGHT
concatenated
by
dependency
labels
,
are
constructing
actions
.
If
no
phrase
is
built
after
an
action
is
performed
,
the
action
is
called
shifting
action
.
Such
action
is
SHIFT
.
We
denote
aj
as
constructing
action
and
bj
as
shifting
action
.
j
indexes
the
time
step
.
Then
we
introduce
a
new
concept
:
parsing
action
phrase
.
We
use
A
to
denote
the
zth
parsing
action
phrase
.
parsing
action
phrase
A
is
a
sequence
of
parsing
actions
that
constructs
the
next
syntactic
phrase
.
A1
consists
of
a
constructing
action
,
A2
consists
of
a
shifting
action
and
a
constructing
action
,
A3
consists
of
a
constructing
action
.
The
indexes
are
different
for
both
sides
of
the
expansion
Az
—
bJ
k.
.
b
-
1aj
,
A
is
the
z'th
parsing
action
phrase
corresponding
to
both
constructing
action
aj
at
time
step
j
and
all
its
preceding
shifting
actions
.
Note
that
on
the
right
side
of
the
expansion
,
only
one
constructing
action
is
allowed
and
is
always
at
the
last
position
,
while
shifting
action
can
occur
several
times
or
does
not
occur
at
all
.
It
is
parsing
action
phrases
,
i.e.
sequences
of
parsing
actions
,
that
are
competing
for
which
next
phrase
should
be
built
.
The
probability
of
the
dependency
tree
given
the
input
sentence
is
redefined
as
:
Where
k
represents
the
number
of
steps
that
shifting
action
can
be
taken
.
contextA
is
the
parsing
state
resulting
from
a
sequence
of
actions
b
k.
.
.
b
a
taken
on
context
.
.
Similar
with
parsing
action
chain
model
(
PACM
)
,
we
use
beam
search
for
the
decoding
of
parsing
action
phrase
model
(
PAPM
)
.
The
difference
is
that
PAPM
do
not
keep
m
best
parsing
states
at
every
parsing
step
.
Instead
,
PAPM
keep
m
best
states
which
are
corresponding
to
m
best
current
parsing
action
phrases
(
several
steps
of
SHIFT
and
the
last
step
of
a
constructing
action
)
.
4
Experiments
and
Results
centage
of
non-projective
relations
,
which
are
0.0
%
,
0.1
%
and
0.3
%
respectively
.
Except
these
three
languages
,
we
use
software
of
projectiviza-tion
/
deprojectivization
provided
by
Nivre
and
Nilsson
(
2005
)
for
other
languages
.
Because
our
algorithm
only
deals
with
projective
parsing
,
we
should
projectivize
training
data
at
first
to
prepare
for
the
following
training
of
our
algorithm
.
During
testing
,
deprojectivization
is
applied
to
the
output
of
the
parser
.
Considering
the
classifier
of
Libsvm
(
Chang
and
Lin
,
2005
)
,
the
features
are
extracted
from
the
following
fields
of
the
data
representation
:
FORM
,
LEMMA
,
CPOSTAG
,
POSTAG
,
FEATS
and
DE
-
PREL
.
We
split
values
of
FEATS
field
into
its
atomic
components
.
We
only
use
available
features
of
DEPREL
field
during
deterministic
parsing
.
We
use
similar
feature
context
window
as
used
in
Ya-mada
's
algorithm
(
Yamada
and
Matsumoto
,
2003
)
.
In
detail
,
the
size
of
feature
context
window
is
six
,
which
consists
of
left
two
sub
trees
,
two
focus
words
related
sub
trees
and
right
two
sub
trees
.
This
feature
template
is
used
for
all
10
languages
.
After
submitting
the
testing
results
of
Parsing
Action
Chain
Model
(
PACM
)
,
we
also
perform
original
deterministic
parsing
proposed
by
Yamada
and
Matsumoto
(
2003
)
.
The
total
results
are
shown
in
table
1
.
The
experimental
results
are
mainly
evaluated
by
labeled
attachment
score
(
LAS
)
,
unlabeled
attachment
score
(
UAS
)
and
labeled
accuracy
(
LA
)
.
Table
1
shows
that
Parsing
Action
Chain
Model
(
PACM
)
outperform
original
Yamada
's
parsing
method
for
all
languages
.
The
LAS
improvements
range
from
0.60
percentage
points
to
1.71
percentage
points
.
Note
that
the
original
Yamada
's
method
still
gives
testing
results
above
the
official
reported
average
performance
of
all
languages
.
Table
1
.
The
performances
of
Yamada
's
method
(
Yam
)
and
Parsing
Action
Chain
Model
(
PACM
)
.
Not
all
languages
have
only
one
root
node
of
a
sentence
.
Since
Parsing
Action
Phrase
Model
(
PAPM
)
only
builds
dependencies
,
and
shifting
action
is
not
the
ending
action
of
a
parsing
action
phrase
,
PAPM
always
ends
with
one
root
word
.
This
property
makes
PAPM
only
suitable
for
Catalan
,
Chinese
,
English
and
Hungarian
,
which
are
unary
root
languages
.
PAPM
result
of
Catalan
was
not
submitted
before
deadline
due
to
the
shortage
of
time
and
computing
resources
.
We
report
Catalan
's
PAPM
result
together
with
that
of
other
three
languages
in
table
2
.
LAS
PAPM
Table
2
.
The
performance
of
Parsing
Action
Phrase
Model
(
PAPM
)
for
Catalan
,
Chinese
,
English
and
Hungarian
.
Compared
with
the
results
of
PACM
shown
in
table
1
,
the
performance
of
PAPM
differs
among
different
languages
.
Catalan
and
English
show
that
PAPM
improves
2.31
%
and
0.86
%
respectively
over
PACM
,
while
the
improvement
of
Chinese
is
marginal
,
and
there
is
a
little
decrease
of
Hungarian
.
Hungarian
has
relatively
high
percentage
of
non-projective
relations
.
If
phrase
consists
of
head
word
and
its
non-projective
children
,
the
constructing
actions
that
are
main
actions
in
PAPM
will
be
very
difficult
to
be
learned
because
some
non-projective
children
together
with
their
heads
have
no
chance
to
be
simultaneously
as
focus
words
.
Although
projectivization
is
also
performed
for
Hungarian
,
the
built-in
non-projective
property
still
has
negative
influence
on
the
performance
.
5
Error
Analysis
In
the
following
we
provide
a
general
error
analysis
across
a
wide
set
of
languages
plus
a
detailed
analysis
of
Chinese
.
5.1
General
Error
Analysis
One
of
the
main
difficulties
in
dependency
parsing
is
the
determination
of
long
distance
dependencies
.
Although
all
kinds
of
evaluation
scores
differ
dramatically
among
different
languages
,
69.91
%
to
85.83
%
regarding
LAS
,
there
are
some
general
observations
reflecting
the
difficulty
of
long
distance
dependency
parsing
.
We
study
this
difficulty
from
two
aspects
about
our
full
submission
of
PACM
:
precision
of
dependencies
of
different
arc
lengths
and
precision
of
root
nodes
.
For
arcs
of
length
1
,
all
languages
give
high
performances
with
lowest
91.62
%
of
Czech
(
Bohmova
et
al.
,
2003
)
to
highest
96.8
%
of
Catalan
(
Marti
et
al.
,
2007
)
.
As
arcs
lengths
grow
longer
,
various
degradations
are
caused
.
For
Catalan
,
score
of
arc
length
2
is
similar
with
that
of
arc
length
1
,
but
there
are
dramatic
degradations
for
longer
arc
lengths
,
from
94.94
%
of
arc
length
2
to
85.22
%
of
length
3-6
.
For
English
(
Johansson
and
Nugues
,
2007
)
and
Italian
(
Montemagni
et
al.
,
2003
)
,
there
are
graceful
degradation
for
arcs
of
length
1,2
and
3-6
,
with
96-91-85
of
English
and
95-85-75
of
Italian
.
For
other
languages
,
long
arcs
also
give
remarkable
degradations
that
pull
down
the
performance
.
Precision
of
root
nodes
also
reflects
the
performance
of
long
arc
dependencies
because
the
arc
between
the
root
and
its
children
are
often
long
arcs
.
In
fact
,
it
is
the
precision
of
roots
and
arcs
longer
than
7
that
mainly
pull
down
the
overall
performance
.
Yamada
's
method
is
a
bottom-up
parsing
algorithm
that
builds
short
distance
dependencies
at
first
.
The
difficulty
of
building
long
arc
dependencies
may
partially
be
resulted
from
the
errors
of
short
distance
dependencies
.
The
deterministic
manner
causes
error
propagation
,
and
it
indirectly
indicates
that
the
errors
of
roots
are
the
final
results
of
error
propagation
of
short
distance
dependencies
.
But
there
is
an
exception
occurred
in
Chinese
.
The
root
precision
is
90.48
%
,
only
below
the
precision
of
arcs
of
length
1
.
This
phenomenon
exists
because
the
sentences
in
Chinese
data
set
(
Chen
et
al.
,
2003
)
are
in
fact
clauses
with
average
length
of
5.9
rather
than
entire
sentences
.
The
root
words
are
heads
of
clauses
.
2003
)
and
Turkish
(
Oflazer
et
al.
,
2003
)
,
the
improvement
of
root
precision
is
small
,
but
dependencies
of
arcs
longer
than
1
give
better
scores
.
For
PAPM
,
good
performances
of
Catalan
and
English
also
give
significant
improvements
of
root
precision
over
PACM
.
For
Catalan
,
the
root
precision
improvement
is
from
63.86
%
to
95.21
%
;
for
English
,
the
root
precision
improvement
is
from
62.03
%
to
89.25
%
.
5.2
Error
Analysis
of
Chinese
There
are
mainly
two
sources
of
errors
regarding
LAS
in
Chinese
dependency
parsing
.
One
is
from
conjunction
words
(
C
)
that
have
a
relatively
high
percentage
of
wrong
heads
(
about
20
%
)
,
and
therefore
19
%
wrong
dependency
labels
.
In
Chinese
,
conjunction
words
often
concatenate
clauses
.
Long
distance
dependencies
between
clauses
are
bridged
by
conjunction
words
.
It
is
difficult
for
conjunction
words
to
find
their
heads
.
The
other
source
of
errors
comes
from
auxiliary
words
(
DE
)
and
preposition
words
(
P
)
.
Unlike
conjunction
words
,
auxiliary
words
and
preposition
words
have
high
performance
of
finding
right
head
,
but
label
accuracy
(
LA
)
decrease
significantly
.
The
reason
may
lie
in
the
large
dependency
label
set
consisting
of
57
kinds
of
dependency
labels
in
Chinese
.
Moreover
,
auxiliary
words
(
DE
)
and
preposition
words
(
P
)
have
more
possible
dependency
labels
than
other
coarse
POS
have
.
This
introduces
ambiguity
for
parsers
.
Most
common
POS
including
noun
and
verb
contribute
much
to
the
overall
performance
of
83
%
Labeled
Attachment
Scores
(
LAS
)
.
Adverbs
obtain
top
score
while
adjectives
give
the
worst
.
6
Conclusion
We
propose
two
kinds
of
probabilistic
models
defined
on
parsing
actions
to
compute
the
probability
of
entire
sentence
.
Compared
with
original
Yamada
and
Matsumoto
's
deterministic
dependency
method
which
stepwisely
chooses
most
probable
parsing
action
,
the
two
probabilistic
models
improve
the
performance
regarding
all
10
languages
in
CoNLL
2007
shared
task
.
Through
the
study
of
parsing
results
,
we
find
that
long
distance
dependencies
are
hard
to
be
determined
for
all
10
languages
.
Further
analysis
about
this
difficulty
is
needed
to
guide
the
research
direction
.
Feature
exploration
is
also
necessary
to
provide
more
informative
features
for
hard
problems
.
Ackowledgements
This
work
was
supported
by
Hi-tech
Research
and
Development
Program
of
China
under
grant
No.
2006AA01Z144
,
the
Natural
Sciences
Foundation
of
China
under
grant
No.
60673042
,
and
the
Natural
Science
Foundation
of
Beijing
under
grant
No.
4052027
,
4073043
.
