We
describe
an
incremental
parser
that
was
trained
to
minimize
cost
over
sentences
rather
than
over
individual
parsing
actions
.
This
is
an
attempt
to
use
the
advantages
of
the
two
top-scoring
systems
in
the
CoNLL-X
shared
task
.
In
the
evaluation
,
we
present
the
performance
of
the
parser
in
the
Multilingual
task
,
as
well
as
an
evaluation
of
the
contribution
of
bidirectional
parsing
and
beam
search
to
the
parsing
performance
.
1
Introduction
The
two
best-performing
systems
in
the
CoNLL-X
shared
task
(
Buchholz
and
Marsi
,
2006
)
can
be
classified
along
two
lines
depending
on
the
method
they
used
to
train
the
parsing
models
.
Although
the
parsers
are
quite
different
,
their
creators
could
report
near-tie
scores
.
The
approach
of
the
top
system
(
McDonald
et
al.
,
2006
)
was
to
fit
the
model
to
minimize
cost
over
sentences
,
while
the
second-best
system
(
Nivre
et
al.
,
2006
)
trained
the
model
to
maximize
performance
over
individual
decisions
in
an
incremental
algorithm
.
This
difference
is
a
natural
consequence
of
their
respective
parsing
strategies
:
CKY-style
maximization
of
link
score
and
incremental
parsing
.
In
this
paper
,
we
describe
an
attempt
to
unify
the
two
approaches
:
an
incremental
parsing
strategy
that
is
trained
to
maximize
performance
over
sentences
rather
than
over
individual
parsing
actions
.
of
input
words
W
,
and
builds
the
parse
tree
incrementally
using
a
set
of
parsing
actions
(
see
Table
1
)
.
It
can
be
shown
that
Nivre
's
parser
creates
projective
and
acyclic
graphs
and
that
every
projective
dependency
graph
can
be
produced
by
a
sequence
of
parser
actions
.
In
addition
,
the
worst-case
number
of
actions
is
linear
with
respect
to
the
number
of
words
in
the
sentence
.
2.2
Handling
Nonprojective
Parse
Trees
While
the
parsing
algorithm
produces
projective
trees
only
,
nonprojective
arcs
can
be
handled
using
a
preprocessing
step
before
training
the
model
and
a
postprocessing
step
after
parsing
the
sentences
.
The
projectivization
algorithm
(
Nivre
and
Nils-son
,
2005
)
iteratively
moves
each
nonprojective
arc
upward
in
the
tree
until
the
whole
tree
is
projective
.
To
be
able
to
recover
the
nonprojective
arcs
after
parsing
,
the
projectivization
operation
replaces
the
labels
of
the
arcs
it
modifies
with
traces
indicating
which
links
should
be
moved
and
where
attach
to
attach
them
(
the
"
Head+Path
"
encoding
)
.
The
model
is
trained
with
these
new
labels
that
makes
it
possible
to
carry
out
the
reverse
operation
and
produce
nonprojective
structures
.
2.3
Bidirectional
Parsing
Shift-reduce
is
by
construction
a
directional
parser
,
typically
applied
from
left
to
right
.
To
make
better
use
of
the
training
set
,
we
applied
the
algorithm
in
both
directions
as
Johansson
and
Nugues
(
2006
)
and
Sagae
and
Lavie
(
2006
)
for
all
languages
except
Catalan
and
Hungarian
.
This
,
we
believe
,
also
has
the
advantage
of
making
the
parser
less
sensitive
to
whether
the
language
is
head-initial
or
head-final
.
We
trained
the
model
on
projectivized
graphs
from
left
to
right
and
right
to
left
and
used
a
voting
strategy
based
on
link
scores
.
Each
link
was
assigned
a
score
(
simply
by
using
the
score
of
the
la
or
ra
actions
for
each
link
)
.
To
resolve
the
conflicts
Table
1
:
Nivre
's
parser
transitions
where
W
is
the
initial
word
list
;
I
,
the
current
input
word
list
;
A
,
the
graph
of
dependencies
;
and
S
,
the
stack
.
(
n
'
,
n
)
denotes
a
dependency
relations
between
n
'
and
n
,
where
n
'
is
the
head
and
n
the
dependent
.
Parser
actions
Conditions
Initialize
Terminate
Left-arc
Right-arc
between
the
two
parses
in
a
manner
that
makes
the
tree
projective
,
single-head
,
rooted
,
and
cycle-free
,
we
applied
the
Eisner
algorithm
(
Eisner
,
1996
)
.
As
in
our
previous
parser
(
Johansson
and
Nugues
,
2006
)
,
we
used
a
beam-search
extension
to
Nivre
's
original
algorithm
(
which
is
greedy
in
its
original
formulation
)
.
Each
parsing
action
was
assigned
a
score
,
and
the
beam
search
allows
us
to
find
a
better
overall
score
of
the
sequence
of
actions
.
In
this
work
,
we
used
a
beam
width
of
8
for
Catalan
,
Chinese
,
Czech
,
and
English
and
16
for
the
other
languages
.
We
model
the
parsing
problem
for
a
sentence
x
as
finding
the
parse
y
=
argmaxy
F
(
x
,
y
)
that
maximizes
a
discriminant
function
F.
In
this
work
,
we
consider
linear
discriminants
of
the
following
form
:
where
*
(
x
,
y
)
is
a
numeric
feature
representation
of
the
pair
(
x
,
y
)
and
w
a
vector
of
feature
weights
.
Learning
F
in
this
case
comes
down
to
assigning
good
weights
in
the
vector
w.
Machine
learning
research
for
similar
problems
have
generally
used
margin-based
formulations
.
These
include
global
batch
methods
such
as
SVMstruct
(
Tsochantaridis
et
al.
,
2005
)
as
well
as
online
methods
such
as
the
Online
Passive-Aggressive
Algorithm
(
OPA
)
(
Crammer
et
al.
,
2006
)
.
Although
the
batch
methods
are
formulated
very
elegantly
,
they
do
not
seem
to
scale
well
to
the
large
training
sets
prevalent
in
NLP
contexts
-
we
briefly
considered
using
sVMstruct
but
training
was
too
time-consuming
.
The
online
methods
on
the
other
hand
,
although
less
theoretically
appealing
,
can
handle
realistically
sized
data
sets
and
have
successfully
been
applied
in
dependency
parsing
(
McDonald
et
al.
,
2006
)
.
Because
of
this
,
we
used
the
OPA
algorithm
throughout
this
work
.
3.2
Implementation
In
the
online
learning
framework
,
the
weight
vector
is
constructed
incrementally
.
At
each
step
,
it
computes
an
update
to
the
weight
vector
based
on
the
current
example
.
The
resulting
weight
vector
is
frequently
overfit
to
the
last
examples
.
One
way
to
reduce
overfitting
is
to
use
the
average
of
all
successive
weight
vectors
as
the
result
of
the
training
(
Freund
and
Schapire
,
1999
)
.
Algorithm
1
shows
the
algorithm
.
It
uses
an
"
aggressiveness
"
parameter
C
to
reduce
overfitting
,
analogous
to
the
C
parameter
in
SVMs
.
The
algorithm
also
needs
a
cost
function
p
,
which
describes
how
much
a
parse
tree
deviates
from
the
gold
standard
.
In
this
work
,
we
defined
p
as
the
sum
of
link
costs
,
where
the
link
cost
was
0
for
a
correct
dependency
link
with
a
correct
label
,
0.5
for
a
correct
link
with
an
incorrect
label
,
and
1
for
an
incorrect
link
.
The
number
of
iterations
was
5
for
all
languages
.
For
a
sentence
x
and
a
parse
tree
y
,
we
defined
the
feature
representation
by
finding
the
sequence
(
(
Si
,
I
\
)
,
a
\
)
,
(
(
S2,12
)
.
.
.
of
states
and
their
corresponding
actions
,
and
creating
a
feature
vector
for
each
state
/
action
pair
.
The
discriminant
function
was
thus
written
where
ip
is
a
feature
function
that
assigns
a
feature
Algorithm
1
The
Online
PA
Algorithm
input
Training
set
T
=
{
(
xt
,
yt
)
}
J
=
1
Number
of
iterations
N
Regularization
parameter
C
Cost
function
p
Initialize
w
to
zeros
repeat
N
times
for
(
xt
,
yt
)
inT
vector
to
a
state
(
Si
,
Ii
)
and
the
action
ai
taken
in
that
state
.
Table
2
shows
the
feature
sets
used
in
0
for
all
languages
.
In
principle
,
a
kernel
could
also
be
used
,
but
that
would
degrade
performance
severely
.
Instead
,
we
formed
a
new
vector
by
combining
features
pairwisely
-
this
is
equivalent
to
using
a
quadratic
kernel
.
Since
the
history-based
feature
set
used
in
the
parsing
algorithm
makes
it
impossible
to
use
independence
to
factorize
the
scoring
function
,
an
exact
search
to
find
the
best-scoring
action
sequence
(
arg
maxy
in
Algorithm
1
)
is
not
possible
.
However
,
the
beam
search
allows
us
to
find
a
reasonable
approximation
.
4
Results
Table
3
shows
the
results
of
our
system
in
the
Multilingual
task
.
4.1
Compared
to
SVM-based
Local
Classifiers
We
compared
the
performance
of
the
parser
with
a
parser
based
on
local
SVM
classifiers
(
Johansson
and
Nugues
,
2006
)
.
Table
4
shows
the
performance
of
both
parsers
on
the
Basque
test
set
.
We
see
that
what
is
gained
by
using
a
global
method
such
as
OPA
is
lost
by
sacrificing
the
excellent
classification
performance
of
the
SVM
.
Possibly
,
better
performance
could
be
achieved
by
using
a
large-margin
batch
method
such
as
sVMstruct
.
Table
2
:
Feature
sets
.
Fine
POS
list
Features
top
Features
list
Features
list-1
Features
list+1
Features
list+2
Word
top
Word
top-1
Word
list
Word
list-1
Word
list+1
Lemma
top
Lemma
list
Lemma
list-1
Relation
top
Relation
top
left
Relation
top
right
Relation
list
right
Word
top
left
Word
top
right
Word
list
left
POS
top
left
POS
top
right
POS
list
left
Features
top
right
Features
first
left
Table
3
:
Summary
of
results
.
Languages
Unlabeled
Hungarian
Average
result
Table
4
:
Accuracy
by
learning
method
.
Learning
Method
To
investigate
the
influence
of
the
beam
width
on
the
performance
,
we
measured
the
accuracy
of
a
left-to-right
parser
on
a
development
set
for
Basque
(
15
%
of
the
training
data
)
as
a
function
of
the
width
.
Table
5
shows
the
result
.
We
see
clearly
that
widening
the
beam
considerably
improves
the
figures
,
especially
in
the
lower
ranges
.
Table
5
:
Accuracy
by
beam
width
.
We
also
investigated
the
contribution
of
the
bidirectional
parsing
.
Table
6
shows
the
result
of
this
experiment
on
the
Basque
development
set
(
the
same
15
%
as
in
4.2
)
.
The
beam
width
was
2
in
this
experiment
.
Table
6
:
Accuracy
by
parsing
direction
.
Direction
Accuracy
Left
to
right
Right
to
left
Bidirectional
Time
did
not
allow
a
full-scale
experiment
,
but
for
all
languages
except
Catalan
and
Hungarian
,
the
bidirectional
parsing
method
outperformed
the
unidirectional
methods
when
trained
on
a
20,000-word
subset
.
However
,
the
gain
of
using
bidirectional
parsing
may
be
more
obvious
when
the
treebank
is
small
.
For
all
languages
except
Czech
,
left-to-right
outperformed
right-to-left
parsing
.
5
Discussion
The
paper
describes
an
incremental
parser
that
we
trained
to
minimize
the
cost
over
sentences
,
rather
than
over
parsing
actions
as
is
usually
done
.
It
was
trained
using
the
Online
Passive-Aggressive
method
,
a
cost-sensitive
online
margin-based
learning
method
,
and
shows
reasonable
performance
and
received
above-average
scores
for
most
languages
.
The
performance
of
the
parser
(
relative
the
other
teams
)
was
best
for
Basque
and
Turkish
,
whichwere
two
of
the
smallest
treebanks
.
Since
we
found
that
the
optimal
number
of
iterations
was
5
for
Basque
(
the
smallest
treebank
)
,
we
used
this
number
for
all
languages
since
we
did
not
have
time
to
investigate
this
parameter
for
the
other
languages
.
This
may
have
had
a
detrimental
effect
for
some
languages
.
We
think
that
some
of
the
figures
might
be
squeezed
slightly
higher
by
optimizing
learning
parameters
and
feature
sets
.
This
work
shows
that
it
was
possible
to
combine
approaches
used
by
Nivre
's
and
McDonald
's
parsers
in
a
single
system
.
While
the
parser
is
outperformed
by
a
system
based
on
local
classifiers
,
we
still
hope
that
the
parsing
and
training
combination
described
here
opens
new
ways
in
parser
design
and
eventually
leads
to
the
improvement
of
parsing
performance
.
Acknowledgements
