We
present
a
data-driven
variant
of
the
LR
algorithm
for
dependency
parsing
,
and
extend
it
with
a
best-first
search
for
probabilistic
generalized
LR
dependency
parsing
.
Parser
actions
are
determined
by
a
classifier
,
based
on
features
that
represent
the
current
state
of
the
parser
.
We
apply
this
parsing
framework
to
both
tracks
of
the
CoNLL
2007
shared
task
,
in
each
case
taking
advantage
of
multiple
models
trained
with
different
learners
.
In
the
multilingual
track
,
we
train
three
LR
models
for
each
of
the
ten
languages
,
and
combine
the
analyses
obtained
with
each
individual
model
with
a
maximum
spanning
tree
voting
scheme
.
In
the
domain
adaptation
track
,
we
use
two
models
to
parse
unlabeled
data
in
the
target
domain
to
supplement
the
labeled
out-of-domain
training
set
,
in
a
scheme
similar
to
one
iteration
of
co-training
.
1
Introduction
There
are
now
several
approaches
for
multilingual
dependency
parsing
,
as
demonstrated
in
the
CoNLL
2006
shared
task
(
Buchholz
and
Marsi
,
2006
)
.
The
dependency
parsing
approach
presented
here
extends
the
existing
body
of
work
mainly
in
four
ways
:
Although
stepwise
1
dependency
parsing
has
commonly
been
performed
using
parsing
algo
-
1
Stepwise
parsing
considers
each
step
in
a
parsing
algorithm
separately
,
while
all-pairs
parsing
considers
entire
rithms
designed
specifically
for
this
task
,
such
as
those
described
by
Nivre
(
2003
)
and
Yamada
and
Matsumoto
(
2003
)
,
we
show
that
this
can
also
be
done
using
the
well
known
LR
parsing
algorithm
(
Knuth
,
1965
)
,
providing
a
connection
between
current
research
on
shift-reduce
dependency
parsing
and
previous
parsing
work
using
LR
and
GLR
models
;
We
generalize
the
standard
deterministic
step-wise
framework
to
probabilistic
parsing
,
with
the
use
of
a
best-first
search
strategy
similar
to
the
one
employed
in
constituent
parsing
by
Rat-naparkhi
(
1997
)
and
later
by
Sagae
and
Lavie
We
provide
additional
evidence
that
the
parser
ensemble
approach
proposed
by
Sagae
and
La-vie
(
2006a
)
can
be
used
to
improve
parsing
accuracy
,
even
when
only
a
single
parsing
algorithm
is
used
,
as
long
as
variation
can
be
obtained
,
for
example
,
by
using
different
learning
techniques
or
changing
parsing
direction
from
forward
to
backward
(
of
course
,
even
greater
gains
may
be
achieved
when
different
algorithms
are
used
,
although
this
is
not
pursued
here
)
;
and
,
finally
,
We
present
a
straightforward
way
to
perform
parser
domain
adaptation
using
unlabeled
data
in
the
target
domain
.
We
entered
a
system
based
on
the
approach
described
in
this
paper
in
the
CoNLL
2007
shared
trees
.
For
a
more
complete
definition
,
see
the
CoNLL-X
shared
task
description
paper
(
Buchholz
and
Marsi
,
2006
)
.
CHILDES
database
(
MacWhinney
,
2000
;
Brown
,
1973
)
.
Our
system
's
accuracy
was
the
highest
in
the
domain
adaptation
track
(
with
labeled
attachment
score
of
81.06
%
)
,
and
only
0.43
%
below
the
top
scoring
system
in
the
multilingual
parsing
track
(
our
average
labeled
attachment
score
over
the
ten
languages
was
79.89
%
)
.
We
first
describe
our
approach
to
multilingual
dependency
parsing
,
followed
by
our
approach
for
domain
adaptation
.
We
then
provide
an
analysis
of
the
results
obtained
with
our
system
,
and
discuss
possible
improvements
.
2
A
Probabilistic
LR
Approach
for
Dependency
Parsing
Our
overall
parsing
approach
uses
a
best-first
probabilistic
shift-reduce
algorithm
based
on
the
LR
algorithm
(
Knuth
,
1965
)
.
As
such
,
it
follows
a
bottom-up
strategy
,
or
bottom-up-trees
,
as
defined
in
Buchholz
and
Marsi
(
2006
)
,
in
contrast
to
the
shift-reduce
dependency
parsing
algorithm
described
by
Nivre
(
2003
)
,
which
is
a
bottom-up
/
top-down
hybrid
,
or
bottom-up-spans
.
It
is
unclear
whether
the
use
of
a
bottom-up-trees
algorithm
has
any
advantage
over
the
use
of
a
bottom-up-spans
algorithm
(
or
vice-versa
)
in
practice
,
but
the
availability
of
different
algorithms
that
perform
the
same
parsing
task
could
be
advantageous
in
parser
ensembles
.
The
main
difference
between
our
parser
and
a
traditional
LR
parser
is
that
we
do
not
use
an
LR
table
derived
from
an
explicit
grammar
to
determine
shift
/
reduce
actions
.
Instead
,
we
use
a
classifier
with
features
derived
from
much
of
the
same
information
contained
in
an
LR
table
:
the
top
few
items
on
the
stack
,
and
the
next
few
items
of
lookahead
in
the
remaining
input
string
.
Additionally
,
following
Sagae
and
Lavie
(
2006
)
,
we
extend
the
basic
deterministic
LR
algorithm
with
a
bestfirst
search
,
which
results
in
a
parsing
strategy
similar
to
generalized
LR
parsing
(
Tomita
,
1987
;
1990
)
,
except
that
we
do
not
perform
Tomita
's
stack-merging
operations
.
The
resulting
algorithm
is
projective
,
and
non-projectivity
is
handled
by
pseudo-projective
transformations
as
described
in
(
Nivre
and
Nilsson
,
2005
)
.
We
use
Nivre
and
Nilsson
's
PATH
scheme2
.
For
clarity
,
we
first
describe
the
basic
variant
of
the
LR
algorithm
for
dependency
parsing
,
which
is
a
deterministic
stepwise
algorithm
.
We
then
show
how
we
extend
the
deterministic
parser
into
a
bestfirst
probabilistic
parser
.
2.1
Dependency
Parsing
with
a
Data-Driven
Variant
of
the
LR
Algorithm
The
two
main
data
structures
in
the
algorithm
are
a
stack
S
and
a
queue
Q.
S
holds
subtrees
of
the
final
dependency
tree
for
an
input
sentence
,
and
Q
holds
the
words
in
an
input
sentence
.
S
is
initialized
to
be
empty
,
and
Q
is
initialized
to
hold
every
word
in
the
input
in
order
,
so
that
the
first
word
in
the
input
is
in
the
front
of
the
queue.3
The
parser
performs
two
main
types
of
actions
:
shift
and
reduce
.
When
a
shift
action
is
taken
,
a
word
is
shifted
from
the
front
of
Q
,
and
placed
on
the
top
of
S
(
as
a
tree
containing
only
one
node
,
the
word
itself
)
.
When
a
reduce
action
is
taken
,
the
2
The
PATH
scheme
was
chosen
(
even
though
Nivre
and
Nilsson
report
slightly
better
results
with
the
HEAD
scheme
)
because
it
does
not
result
in
a
potentially
quadratic
increase
in
the
number
of
dependency
label
types
,
as
observed
with
the
HEAD
and
HEAD+PATH
schemes
.
Unfortunately
,
experiments
comparing
the
use
of
the
different
pseudo-projectivity
schemes
were
not
performed
due
to
time
constraints
.
3
We
append
a
"
virtual
root
"
word
to
the
beginning
of
every
sentence
,
which
is
used
as
the
head
of
every
word
in
the
dependency
structure
that
does
not
have
a
head
in
the
sentence
.
two
top
items
in
S
(
s1
and
s2
)
are
popped
,
and
a
new
item
is
pushed
onto
S.
This
new
item
is
a
tree
formed
by
making
the
root
s1
of
a
dependent
of
the
root
of
s2
,
or
the
root
of
s2
a
dependent
of
the
root
of
s1
.
Depending
on
which
of
these
two
cases
occur
,
we
call
the
action
reduce-left
or
reduce-right
,
according
to
whether
the
head
of
the
new
tree
is
to
the
left
or
to
the
right
its
new
dependent
.
In
addition
to
deciding
the
direction
of
a
reduce
action
,
the
label
of
the
newly
formed
dependency
arc
must
also
be
decided
.
Parsing
terminates
successfully
when
Q
is
empty
(
all
words
in
the
input
have
been
processed
)
and
S
contains
only
a
single
tree
(
the
final
dependency
tree
for
the
input
sentence
)
.
If
Q
is
empty
,
S
contains
two
or
more
items
,
and
no
further
reduce
actions
can
be
taken
,
parsing
terminates
and
the
input
is
rejected
.
In
such
cases
,
the
remaining
items
in
S
contain
partial
analyses
for
contiguous
segments
of
the
input
.
2.2
A
Probabilistic
LR
Model
for
Dependency
Parsing
In
the
traditional
LR
algorithm
,
parser
states
are
placed
onto
the
stack
,
and
an
LR
table
is
consulted
to
determine
the
next
parser
action
.
In
our
case
,
the
parser
state
is
encoded
as
a
set
of
features
derived
from
the
contents
of
the
stack
S
and
queue
Q
,
and
the
next
parser
action
is
determined
according
to
that
set
of
features
.
In
the
deterministic
case
described
above
,
the
procedure
used
for
determining
parser
actions
(
a
classifier
,
in
our
case
)
returns
a
single
action
.
If
,
instead
,
this
procedure
returns
a
list
of
several
possible
actions
with
corresponding
probabilities
,
we
can
then
parse
with
a
model
similar
to
the
probabilistic
LR
models
described
by
Briscoe
and
Carroll
(
1993
)
,
where
the
probability
of
a
parse
tree
is
the
product
of
the
probabilities
of
each
of
the
actions
taken
in
its
derivation
.
To
find
the
most
probable
parse
tree
according
to
the
probabilistic
LR
model
,
we
use
a
best-first
strategy
.
This
involves
an
extension
of
the
deterministic
shift-reduce
into
a
best-first
shift-reduce
algorithm
.
To
describe
this
extension
,
we
first
introduce
a
new
data
structure
Ti
that
represents
a
parser
state
,
which
includes
a
stack
Si
,
a
queue
Qi
,
and
a
probability
Pi
.
The
deterministic
algorithm
is
a
special
case
of
the
probabilistic
algorithm
where
we
have
a
single
parser
state
T0
that
contains
S0
and
Q0
,
and
the
probability
of
the
parser
state
is
1
.
The
best-first
algorithm
,
on
the
other
hand
,
keeps
a
heap
H
containing
multiple
parser
states
T0
.
.
.
Tm.
These
states
are
ordered
in
the
heap
according
to
their
probabilities
,
which
are
determined
by
multiplying
the
probabilities
of
each
of
the
parser
actions
that
resulted
in
that
parser
state
.
The
heap
H
is
initialized
to
contain
a
single
parser
state
T0
,
which
contains
a
stack
S0
,
a
queue
Q0
and
probability
P0
=
1.0
.
S0
and
Q0
are
initialized
in
the
same
way
as
S
and
Q
in
the
deterministic
algorithm
.
The
best-first
algorithm
then
loops
while
H
is
non-empty
.
At
each
iteration
,
first
a
state
Tcurrent
is
popped
from
the
top
of
H.
If
Tcurrent
corresponds
to
a
final
state
(
Qcurrent
is
empty
and
Scurrent
contains
a
single
item
)
,
we
return
the
single
item
in
Scurrent
as
the
dependency
structure
corresponding
to
the
input
sentence
.
Otherwise
,
we
get
a
list
of
parser
actions
act0
.
.
.
actn
(
with
associated
probabilities
Pact0
.
.
.
Pactn
)
corresponding
to
state
Tcurrent
.
For
each
of
these
parser
actions
actj
,
we
create
a
new
parser
state
Tnew
by
applying
actj
to
Tcurrent
,
and
set
the
probability
Tnew
to
be
Pnew
=
Pcurrnet
*
Pactj
.
Then
,
Tnew
is
inserted
into
the
heap
H.
Once
new
states
have
been
inserted
onto
H
for
each
of
the
n
parser
actions
,
we
move
on
to
the
next
iteration
of
the
algorithm
.
3
Multilingual
Parsing
Experiments
For
each
of
the
ten
languages
for
which
training
data
was
provided
in
the
multilingual
track
of
the
CoNLL
2007
shared
task
,
we
trained
three
LR
models
as
follows
.
The
first
LR
model
for
each
language
uses
maximum
entropy
classification
(
Berger
et
al.
,
1996
)
to
determine
possible
parser
actions
and
their
probabilities4
.
To
control
overfit-ting
in
the
MaxEnt
models
,
we
used
box-type
inequality
constraints
(
Kazama
and
Tsujii
,
2003
)
.
The
second
LR
model
for
each
language
also
uses
MaxEnt
classification
,
but
parsing
is
performed
backwards
,
which
is
accomplished
simply
by
reversing
the
input
string
before
parsing
starts
.
Sa-gae
and
Lavie
(
2006a
)
and
Zeman
and
Zabokrtsky
(
2005
)
have
observed
that
reversing
the
direction
of
stepwise
parsers
can
be
beneficial
in
parser
combinations
.
The
third
model
uses
support
vector
machines5
(
Vapnik
,
1995
)
using
the
polynomial
4
Implementation
by
Yoshimasa
Tsuruoka
,
available
at
http
:
/
/
www-tsujii.is.s.u-tokyo.ac.jp
/
~
tsuruoka
/
maxent
/
5
Implementation
by
Taku
Kudo
,
available
at
http
:
/
/
chasen.org
/
~
taku
/
software
/
TinySVM
/
and
all
vs.
all
was
used
for
multi-class
classification
.
kernel
with
degree
2
.
Probabilities
were
estimated
for
SVM
outputs
using
the
method
described
in
(
Platt
,
1999
)
,
but
accuracy
improvements
were
not
observed
during
development
when
these
estimated
probabilities
were
used
instead
of
simply
the
single
best
action
given
by
the
classifier
(
with
probability
1.0
)
,
so
in
practice
the
SVM
parsing
models
we
used
were
deterministic
.
At
test
time
,
each
input
sentence
is
parsed
using
each
of
the
three
LR
models
,
and
the
three
resulting
dependency
structures
are
combined
according
to
the
maximum-spanning-tree
parser
combination
scheme6
(
Sagae
and
Lavie
,
2006a
)
where
each
dependency
proposed
by
each
of
the
models
has
the
same
weight
(
it
is
possible
that
one
of
the
more
sophisticated
weighting
schemes
proposed
by
Sa-gae
and
Lavie
may
be
more
effective
,
but
these
were
not
attempted
)
.
The
combined
dependency
tree
is
the
final
analysis
for
the
input
sentence
.
Although
it
is
clear
that
fine-tuning
could
provide
accuracy
improvements
for
each
of
the
models
in
each
language
,
the
same
set
of
metaparameters
and
features
were
used
for
all
of
the
ten
languages
,
due
to
time
constraints
during
system
development
.
The
features
used
were7
:
•
the
number
of
children
of
the
root
word
of
the
subtrees
;
•
the
number
of
children
of
the
root
word
of
the
subtree
to
the
right
of
the
root
word
;
•
the
number
of
children
of
the
root
word
of
the
subtree
to
the
left
of
the
root
word
;
•
the
POS
tag
and
DEPREL
of
the
rightmost
and
leftmost
children
;
•
The
previous
parser
action
;
•
The
features
listed
for
the
root
words
of
the
subtrees
in
table
1
.
In
addition
,
the
MaxEnt
models
also
used
selected
combinations
of
these
features
.
The
classes
used
to
represent
parser
actions
were
designed
to
encode
all
aspects
of
an
action
(
shift
vs.
reduce
,
right
vs.
left
,
and
dependency
label
)
simultaneously
.
Results
for
each
of
the
ten
languages
are
shown
in
table
2
as
labeled
and
unlabeled
attachment
scores
,
along
with
the
average
labeled
attachment
score
and
highest
labeled
attachment
score
for
all
participants
in
the
shared
task
.
Our
results
shown
in
boldface
were
among
the
top
three
scores
for
those
particular
languages
(
five
out
of
the
ten
languages
)
.
Table
1
:
Additional
features
.
Language
Hungarian
Table
2
:
Multilingual
results
.
4
Domain
Adaptation
Experiments
In
a
similar
way
as
we
used
multiple
LR
models
in
the
multilingual
track
,
in
the
domain
adaptation
track
we
first
trained
two
LR
models
on
the
out-of
-
6
Each
dependency
tree
is
deprojectivized
before
the
combination
occurs
.
7
S
(
n
)
denotes
the
nth
item
from
the
top
of
the
stack
(
where
S
(
1
)
is
the
item
on
top
of
the
stack
)
,
and
Q
(
n
)
denotes
the
nth
item
in
the
queue
.
For
a
description
of
the
features
names
in
capital
letters
,
see
the
shared
task
description
(
Nivre
et
al.
,
2007
)
.
domain
labeled
training
data
.
The
first
was
a
forward
MaxEnt
model
,
and
the
second
was
a
backward
SVM
model
.
We
used
these
two
models
to
perform
a
procedure
similar
to
a
single
iteration
of
co-training
,
except
that
selection
of
the
newly
(
automatically
)
produced
training
instances
was
done
by
selecting
sentences
for
which
the
two
models
produced
identical
analyses
.
On
the
development
data
we
verified
that
sentences
for
which
there
was
perfect
agreement
between
the
two
models
had
labeled
attachment
score
just
above
90
on
average
,
even
though
each
of
the
models
had
accuracy
between
78
and
79
over
the
entire
development
set
.
Our
approach
was
as
follows
:
We
trained
the
forward
MaxEnt
and
backward
SVM
models
using
the
out-of-domain
labeled
training
data
;
We
then
used
each
of
the
models
to
parse
the
first
two
of
the
three
sets
of
domain-specific
unlabeled
data
that
were
provided
(
we
did
not
use
the
larger
third
set
)
We
compared
the
output
for
the
two
models
,
and
selected
only
identical
analyses
that
were
produced
by
each
of
the
two
separate
models
;
We
retrained
the
forward
MaxEnt
model
with
the
new
larger
training
set
;
and
finally
We
used
this
model
to
parse
the
test
data
.
Following
this
procedure
we
obtained
a
labeled
attachment
score
of
81.06
,
and
unlabeled
attachment
score
of
83.42
,
both
the
highest
scores
for
this
track
.
This
was
done
without
the
use
of
any
additional
resources
(
closed
track
)
,
but
these
results
are
also
higher
than
the
top
score
for
the
open
track
,
where
the
use
of
certain
additional
resources
was
allowed
.
See
(
Nivre
et
al.
,
2007
)
.
5
Analysis
and
Discussion
One
of
the
main
assumptions
in
our
use
of
different
models
based
on
the
same
algorithm
is
that
while
the
output
generated
by
those
models
may
often
differ
,
agreement
between
the
models
is
an
indication
of
correctness
.
In
our
domain
adaptation
approach
,
this
was
clearly
true
.
In
fact
,
the
approach
would
not
have
worked
if
this
assumption
was
false
.
Experiments
on
the
development
set
were
encouraging
.
As
stated
before
,
when
the
parsers
agreed
,
labeled
attachment
score
was
over
90
,
even
though
the
score
of
each
model
alone
was
lower
than
79
.
The
domain-adapted
parser
had
a
score
of
82.1
,
a
significant
improvement
.
Interestingly
,
the
ensemble
used
in
the
multilingual
track
also
produced
good
results
on
the
development
set
for
the
domain
adaptation
data
,
without
the
use
of
the
unlabeled
data
at
all
,
with
a
score
of
81.9
(
although
the
ensemble
is
more
expensive
to
run
)
.
The
different
models
used
in
each
track
were
distinct
in
a
few
ways
:
(
1
)
direction
(
forward
or
backward
)
;
(
2
)
learner
(
MaxEnt
or
SVM
)
;
and
(
3
)
search
strategy
(
best-first
or
deterministic
)
.
Of
those
differences
,
the
first
one
is
particularly
interesting
in
single-stack
shift-reduce
models
,
as
ours
.
In
these
models
,
the
context
to
each
side
of
a
(
potential
)
dependency
differs
in
a
fundamental
way
.
To
one
side
,
we
have
tokens
that
have
already
been
processed
and
are
already
in
subtrees
,
and
to
the
other
side
we
simply
have
a
look-ahead
of
the
remaining
input
sentence
.
This
way
,
the
context
of
the
same
dependency
in
a
forward
parser
may
differ
significantly
from
the
context
of
the
same
dependency
in
a
backward
parser
.
Interestingly
,
the
accuracy
scores
of
the
MaxEnt
backward
models
were
found
to
be
generally
just
below
the
accuracy
of
their
corresponding
forward
models
when
tested
on
development
data
,
with
two
exceptions
:
Hungarian
and
Turkish
.
In
Hungarian
,
the
accuracy
scores
produced
by
the
forward
and
backward
MaxEnt
LR
models
were
not
significantly
different
,
with
both
labeled
attachment
scores
at
about
77.3
(
the
SVM
model
score
was
76.1
,
and
the
final
combination
score
on
development
data
was
79.3
)
.
In
Turkish
,
however
,
the
backward
score
was
significantly
higher
than
the
forward
score
,
75.0
and
72.3
,
respectively
.
The
forward
SVM
score
was
73.1
,
and
the
combined
score
was
75.8
.
In
experiments
performed
after
the
official
submission
of
results
,
we
evaluated
a
backward
SVM
model
(
which
was
trained
after
submission
)
on
the
same
development
set
,
and
found
it
to
be
significantly
more
accurate
than
the
forward
model
,
with
a
score
of
75.7
.
Adding
that
score
to
the
combination
raised
the
combination
score
to
77.9
(
a
large
improvement
from
75.8
)
.
The
likely
reason
for
this
difference
is
that
over
80
%
of
the
dependencies
in
the
Turkish
data
set
have
the
head
to
the
right
of
the
dependent
,
while
only
less
than
4
%
have
the
head
to
the
left
.
This
means
that
the
backward
model
builds
much
more
partial
structure
in
the
stack
as
it
consumes
input
tokens
,
while
the
forward
model
must
consume
most
tokens
before
it
starts
making
attachments
.
In
other
words
,
context
in
general
in
the
backward
model
has
more
structure
,
and
attachments
are
made
while
there
are
still
look-ahead
tokens
,
while
the
opposite
is
generally
true
in
the
forward
model
.
6
Conclusion
Our
results
demonstrate
the
effectiveness
of
even
small
ensembles
of
parsers
that
are
relatively
similar
(
using
the
same
features
and
the
same
algorithm
)
.
There
are
several
possible
extensions
and
improvements
to
the
approach
we
have
described
.
For
example
,
in
section
3
we
mention
the
use
of
different
weighting
schemes
in
dependency
voting
.
We
list
additional
ideas
that
were
not
attempted
due
to
time
constraints
,
but
that
are
likely
to
produce
improved
results
.
One
of
the
simplest
improvements
to
our
approach
is
simply
to
train
more
models
with
no
other
changes
to
our
set-up
.
As
mentioned
in
section
5
,
the
addition
of
a
backward
SVM
model
did
improve
accuracy
on
the
Turkish
set
significantly
,
and
it
is
likely
that
improvements
would
also
be
obtained
in
other
languages
.
In
addition
,
other
learning
approaches
,
such
as
memory-based
language
processing
(
Daelemans
and
Van
den
Bosch
,
2005
)
,
could
be
used
.
A
drawback
of
adding
more
models
that
became
obvious
in
our
experiments
was
the
increased
cost
of
both
training
(
for
example
,
the
SVM
parsers
we
used
required
significantly
longer
to
train
than
the
MaxEnt
parsers
)
and
run-time
(
parsing
with
MBL
models
can
be
several
times
slower
than
with
MaxEnt
,
or
even
SVM
)
.
A
similar
idea
that
may
be
more
effective
,
but
requires
more
effort
,
is
to
add
parsers
based
on
different
approaches
.
For
example
,
using
MSTParser
(
McDonald
and
Pereira
,
2005
)
,
a
large-margin
all-pairs
parser
,
in
our
domain
adaptation
procedure
results
in
significantly
improved
accuracy
(
83.2
LAS
)
.
Of
course
,
the
use
of
different
approaches
used
by
different
groups
in
the
CoNLL
2006
and
2007
shared
tasks
represents
great
opportunity
for
parser
ensembles
.
Acknowledgements
We
thank
the
shared
task
organizers
and
treebank
providers
.
We
also
thank
the
reviewers
for
their
comments
and
suggestions
,
and
Yusuke
Miyao
for
insightful
discussions
.
This
work
was
supported
in
part
by
Grant-in-Aid
for
Specially
Promoted
Research
18002007
.
