This
paper
presents
an
online
algorithm
for
dependency
parsing
problems
.
We
propose
an
adaptation
of
the
passive
and
aggressive
online
learning
algorithm
to
the
dependency
parsing
domain
.
We
evaluate
the
proposed
algorithms
on
the
2007
CONLL
Shared
Task
,
and
report
errors
analysis
.
Experimental
results
show
that
the
system
score
is
better
than
the
average
score
among
the
participating
systems
.
1
Introduction
Singer
,
2003
)
.
The
difference
of
MIRA-based
parsing
in
comparison
with
history-based
methods
is
that
the
MIRA-based
parser
were
trained
to
maximize
the
accuracy
of
the
overall
tree
.
The
MIRA
based
parsing
is
close
to
maximum-margin
parsing
as
in
Taskar
et
al.
(
2004
)
and
Tsochantaridis
et
al.
(
2005
)
for
parsing
.
However
,
unlike
maximum-margin
parsing
,
it
is
not
limited
to
parsing
sentences
of
15
words
or
less
due
to
computation
time
.
The
performance
of
MIRA
based
parsing
achieves
the
state-of-the-art
performance
in
English
data
(
McDonald
et
al.
,
2005a
;
McDonald
et
al.
,
2006
)
.
In
this
paper
,
we
propose
a
new
adaptation
of
online
larger-margin
learning
to
the
problem
of
dependency
parsing
.
Unlike
the
MIRA
parser
,
our
method
does
not
need
an
optimization
procedure
in
each
learning
update
,
but
users
only
an
update
equation
.
This
might
lead
to
faster
training
time
and
easier
implementation
.
The
contributions
of
this
paper
are
two-fold
:
First
,
we
present
a
training
algorithm
called
PA
learning
for
dependency
parsing
,
which
is
as
easy
to
implement
as
Perceptron
,
yet
competitive
with
large
margin
methods
.
This
algorithm
has
implications
for
anyone
interested
in
implementing
discriminative
training
methods
for
any
application
.
Second
,
we
evaluate
the
proposed
algorithm
on
the
multilingual
data
task
as
well
as
the
domain
adaptation
task
(
Nivre
et
al.
,
2007
)
.
The
remaining
parts
of
the
paper
are
organized
as
follows
:
Section
2
proposes
our
dependency
parsing
with
Passive-Aggressive
learning
.
Section
3
discusses
some
experimental
results
and
Section
4
gives
conclusions
and
plans
for
future
work
.
2
Dependency
Parsing
with
Passive-Aggressive
Learning
This
section
presents
the
modification
of
Passive-Aggressive
Learning
(
PA
)
(
Crammer
et
al.
,
2006
)
for
dependency
parsing
.
We
modify
the
PA
algorithm
to
deal
with
structured
prediction
,
in
which
our
problem
is
to
learn
a
discriminant
function
that
maps
an
input
sentence
x
to
a
dependency
tree
y.
Figure
1
shows
an
example
of
dependency
parsing
which
depicts
the
relation
of
each
word
to
another
word
within
a
sentence
.
There
are
some
algorithms
root
John
hit
the
ball
with
the
bat
Figure
1
:
This
is
an
example
of
dependency
tree
to
determine
these
relations
of
each
word
to
another
words
,
for
instance
,
the
modified
CKY
algorithm
(
Eisner
,
1996
)
is
used
to
define
these
relations
for
a
given
sentence
.
2.1
Parsing
Algorithm
Dependency-tree
parsing
as
the
search
for
the
maximum
spanning
tree
(
MST
)
in
a
graph
was
proposed
by
McDonald
et
al.
(
2005b
)
.
In
this
subsection
,
we
briefly
describe
the
parsing
algorithms
based
on
the
first-order
MST
parsing
.
Due
to
the
limitation
of
participation
time
,
we
only
applied
the
first-order
decoding
parsing
algorithm
in
CONLL-2007
.
However
,
our
algorithm
can
be
used
for
the
second
order
parsing
.
where
is
a
high-dimensional
binary
fea
-
ture
representation
of
the
edge
from
xwi
to
xwj
.
For
example
in
Figure
1
,
we
can
present
an
example
)
as
follows
;
The
basic
question
must
be
answered
for
models
of
this
form
:
how
to
find
the
dependency
tree
y
with
the
highest
score
for
sentence
x
?
The
two
algorithms
we
employed
in
our
dependency
parsing
model
are
the
Eisner
parsing
(
Eisner
,
1996
)
and
Chu-Liu
's
algorithm
(
Chu
and
Liu
,
1965
)
.
The
algorithms
are
commonly
used
in
other
online-learning
dependency
parsing
,
such
as
in
(
McDonald
et
al.
,
2005a
)
.
In
the
next
subsection
we
will
address
the
problem
of
how
to
estimate
the
weight
wi
associated
with
a
feature
$
i
in
the
training
data
using
an
online
PA
learning
algorithm
.
This
section
presents
a
modification
of
PA
algorithm
for
structured
prediction
,
and
its
use
in
dependency
parsing
.
The
Perceptron
style
for
natural
language
processing
problems
as
initially
proposed
by
(
Collins
,
2002
)
can
provide
state
of
the
art
results
on
various
domains
including
text
chunking
,
syntactic
parsing
,
etc.
The
main
drawback
of
the
Perceptron
style
algorithm
is
that
it
does
not
have
a
mechanism
for
attaining
the
maximize
margin
of
the
training
data
.
It
may
be
difficult
to
obtain
high
accuracy
in
dealing
with
hard
learning
data
.
The
structured
support
vector
machine
(
Tsochantaridis
et
al.
,
2005
)
and
the
maximize
margin
model
(
Taskar
et
al.
,
2004
)
can
gain
a
maximize
margin
value
for
given
training
data
by
solving
an
optimization
problem
(
i.e
quadratic
programming
)
.
It
is
obvious
that
using
such
an
optimization
algorithm
requires
much
computational
time
.
For
dependency
parsing
domain
,
McDonald
et
al
(
2005a
)
modified
the
MIRA
learning
algorithm
(
McDonald
et
al.
,
2005a
)
for
structured
domains
in
which
the
optimization
problem
can
be
solved
by
using
Hidreth
's
algorithm
(
Censor
and
Zenios
,
1997
)
,
which
is
faster
than
the
quadratic
programming
technique
.
In
contrast
to
the
previous
method
,
this
paper
presents
an
online
algorithm
for
dependency
parsing
in
which
we
can
attain
the
maximize
margin
of
the
training
data
without
using
optimization
techniques
.
It
is
thus
much
faster
and
easier
to
implement
.
The
details
of
PA
algorithm
for
dependency
parsing
are
presented
below
.
ciated
with
a
weight
value
.
The
goal
of
PA
learning
for
dependency
parsing
is
to
obtain
a
parameter
w
that
minimizes
the
hinge-loss
function
and
the
margin
of
learning
data
.
2
Aggressive
parameter
C
3
Output
:
the
PA
learning
model
6
Receive
an
sentence
xt
Algorithm
1
:
The
Passive-Aggressive
algorithm
for
dependency
parsing
.
Algorithm
1
shows
the
PA
learning
algorithm
for
dependency
parsing
in
which
its
three
variants
are
different
only
in
the
update
formulas
.
In
Algorithm
1
,
we
employ
two
kinds
of
argmax
algorithms
:
The
first
is
the
decoding
algorithm
for
projective
language
data
and
the
second
one
is
for
non-projective
language
data
.
Algorithm
1
shows
(
line
8
)
p
(
y
,
yt
)
is
a
real-valued
loss
for
the
tree
yt
relative
to
the
correct
tree
y.
We
define
the
loss
of
a
dependency
tree
as
the
number
of
words
which
have
an
incorrect
parent
.
Thus
,
the
largest
loss
a
dependency
tree
can
have
is
the
length
of
the
sentence
.
The
similar
loss
function
is
designed
for
the
dependency
tree
with
labeled
.
Algorithm
1
returns
an
averaged
weight
vector
:
an
auxiliary
weight
vector
v
is
maintained
that
accumulates
the
values
of
w
after
each
iteration
,
and
the
returned
weight
vector
is
the
average
of
all
the
weight
vectors
throughout
training
.
Averaging
has
been
shown
to
help
reduce
overfitting
(
McDonald
et
al.
,
2005a
;
Collins
,
2002
)
.
It
is
easy
to
see
that
the
main
difference
between
the
PA
algorithms
and
the
Perceptron
algorithm
(
PC
)
(
Collins
,
2002
)
as
well
as
the
MIRA
algorithm
(
McDonald
et
al.
,
2005a
)
is
in
line
9
.
As
we
can
see
in
the
PC
algorithm
,
we
do
not
need
the
value
t
and
in
the
MIRA
algorithm
we
need
an
optimization
algorithm
to
compute
t.
We
also
have
three
updated
formulations
for
obtaining
Tt
in
Line
9
.
In
the
scope
of
this
paper
,
we
only
focus
on
using
the
second
update
formulation
(
PA-I
method
)
for
training
dependency
parsing
data
.
Table
3
:
Feature
Set
3
:
In
Between
POS
Features
and
Surrounding
Word
POS
Features
features
used
in
our
system
are
described
below
.
Tables
1
and
2
show
our
basic
features
.
These
features
are
added
for
entire
words
as
well
as
for
the
5-gram
prefix
if
the
word
is
longer
than
5
characters
.
•
In
addition
to
these
features
shown
in
Table
1
,
the
morphological
information
for
each
pair
of
words
p-word
and
c-word
are
represented
.
In
addition
,
we
also
add
the
conjunction
morphological
information
of
p-word
and
c-word
.
We
do
not
use
the
LEMMA
and
CPOSTAG
information
in
our
set
features
.
The
morphological
information
are
obtained
from
FEAT
information
.
•
Table
3
shows
our
Feature
set
3
which
take
the
form
of
a
POS
trigram
:
the
POS
of
the
parent
,
of
the
child
,
and
of
a
word
in
between
,
for
all
words
linearly
between
the
parent
and
the
child
.
This
feature
was
particularly
helpful
for
nouns
identifying
their
parent
(
McDonald
et
al.
,
2005a
)
.
•
Table
3
also
depicts
these
features
taken
the
form
of
a
POS
4-gram
:
The
POS
of
the
parent
,
child
,
word
before
/
after
parent
and
word
before
/
after
child
.
The
system
also
used
backoff
features
with
various
trigrams
where
one
of
the
local
context
POS
tags
was
removed
.
•
All
features
are
also
conjoined
with
the
direction
of
attachment
,
as
well
as
the
distance
between
the
two
words
being
attached
.
3
Experimental
Results
and
Discussion
4
shows
the
number
of
training
and
testing
sentences
for
these
languages
.
The
table
shows
that
the
sentence
length
in
Arabic
data
is
largest
and
its
size
of
training
data
is
smallest
.
These
factors
might
be
af
-
fected
to
the
accuracy
of
our
proposed
algorithm
as
we
will
discuss
later
.
The
training
and
testing
were
conducted
on
a
pentium
IV
at
4.3
GHz
.
The
detailed
information
about
the
data
are
shown
in
the
CONLL-2007
shared
task
.
We
applied
non-projective
and
projective
parsing
along
with
PA
learning
for
the
data
in
CONLL-2007
.
Table
5
reports
experimental
results
by
using
the
first
order
decoding
method
in
which
an
MST
parsing
algorithm
(
McDonald
et
al.
,
2005b
)
is
applied
for
non-projective
parsing
and
the
Eisner
's
method
is
used
for
projective
language
data
.
In
fact
,
in
our
method
we
applied
non-projective
parsing
for
the
Italian
data
,
the
Turkish
data
,
and
the
Greek
data
.
This
was
because
we
did
not
have
enough
time
to
train
all
training
data
using
both
projective
and
non-projective
parsing
.
This
is
the
problem
of
discriminative
learning
methods
when
performing
on
a
large
set
of
training
data
.
In
addition
,
to
save
time
in
training
we
set
the
number
of
best
trees
k
to
1
and
the
parameter
C
is
set
to
0.05
.
Table
5
shows
the
comparison
of
the
proposed
method
with
the
average
,
and
three
top
systems
on
the
CONLL-2007
.
As
a
result
,
our
method
yields
results
above
the
average
score
on
the
CONLL-2007
shared
task
(
Nivre
et
al.
,
2007
)
.
Table
5
also
indicates
that
the
Basque
results
obtained
a
lower
score
than
other
data
.
We
obtained
69.11
UA
score
and
58.16
LA
score
,
respectively
.
These
are
far
from
the
results
of
the
Top3
scores
(
81.13
and
75.49
)
.
We
checked
the
outputs
of
the
Basque
data
to
understand
the
main
reason
for
the
errors
.
We
see
that
the
errors
in
our
methods
are
usually
mismatched
with
the
gold
data
at
the
labels
"
ncmod
"
and
"
ncsubj
"
.
The
main
reason
might
be
that
the
application
of
projective
parsing
for
this
data
in
both
training
and
testing
is
not
suitable
.
This
was
because
the
number
of
sentences
with
at
least
1
non
projective
relation
in
the
data
is
large
(
26.1
)
.
The
Arabic
score
is
lower
than
the
scores
of
other
data
because
of
some
difficulties
in
our
method
as
follows
.
Morphological
and
sentence
length
problems
are
the
main
factors
which
affect
the
accuracy
of
parsing
Arabic
data
.
In
addition
,
the
training
size
in
the
Arabic
is
also
a
problem
for
obtaining
a
good
result
.
Furthermore
,
since
our
tasks
was
focused
on
improving
the
accuracy
of
English
data
,
it
might
be
unsuitable
for
other
languages
.
This
is
an
imbalance
Languages
Training
size
Tokens
size
tokens-per-sent
Hungarian
Table
4
:
The
data
used
in
the
multilingual
track
(
Nivre
et
al.
,
2007
)
.
NPR
means
non-projective-relations
.
AL-1-NPR
means
at-least-least
1
non-projective
relation
.
problem
in
our
method
.
Table
5
also
shows
the
comparison
of
our
system
to
the
average
score
and
the
Top3
scores
.
It
depicts
that
our
system
is
accurate
in
English
data
,
while
it
has
low
accuracy
in
Basque
and
Arabic
data
.
We
also
evaluate
our
models
in
the
domain
adaptation
tasks
.
This
task
is
to
adapt
our
model
trained
on
PennBank
data
to
the
test
data
in
the
Biomedical
domain
.
The
pchemtb-closed
shared
task
(
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Kulick
et
al.
,
2004
)
is
used
to
illustrate
our
models
.
We
do
not
use
any
additional
unlabeled
data
in
the
Biomedical
domain
.
Only
the
training
data
in
the
PennBank
is
used
to
train
our
model
.
Afterward
,
we
selected
carefully
a
suitable
parameter
using
the
development
test
set
.
We
set
the
parameter
C
to
0.01
and
select
the
non
projective
parsing
for
testing
to
obtain
the
highest
result
in
the
development
data
after
performing
several
experiments
.
After
that
,
the
trained
model
was
used
to
test
the
data
in
Biomedical
domain
.
The
score
(
UA
=
82.04
;
LA
=
79.50
)
shows
that
our
method
yields
results
above
the
average
score
(
UA
=
76.42
;
LA
=
73.03
)
.
In
addition
,
it
is
officially
coming
in
4th
place
out
of
12
teams
and
within
1.5
%
of
the
top
systems
.
The
good
result
of
performing
our
model
in
another
domain
suggested
that
the
PA
learning
seems
sensitive
to
noise
.
We
hope
that
this
problem
is
solved
in
future
work
.
4
Conclusions
This
paper
presents
an
online
algorithm
for
dependency
parsing
problem
which
have
tested
on
various
language
data
in
CONLL-2007
shared
task
.
The
performance
in
English
data
is
close
to
the
Top3
score
.
We
also
perform
our
algorithm
on
the
domain
adaptation
task
,
in
which
we
only
focus
on
the
training
of
the
source
data
and
select
a
suitable
parameter
using
the
development
set
.
The
result
is
very
good
as
it
is
close
to
the
Top3
score
of
participating
systems
.
Future
work
will
also
be
focused
on
extending
our
method
to
a
version
of
using
semi-supervised
learning
that
can
efficiently
be
learnt
by
using
labeled
and
unlabeled
data
.
We
hope
that
the
application
of
the
PA
algorithm
to
other
NLP
problems
such
as
semantic
parsing
will
be
explored
in
future
work
.
Acknowledgments
We
would
like
to
thank
D.
Yuret
for
his
helps
in
checking
errors
of
my
parser
's
outputs
.
We
would
like
to
thank
Vinh-Van
Nguyen
his
helps
during
the
revision
process
and
Mary
Ann
Mooradian
for
correcting
the
paper
.
We
would
like
to
thank
to
anonymous
reviewers
for
helpful
discussions
and
comments
on
the
manuscript
.
Thank
also
to
Sebastian
Riedel
for
checking
the
issues
raised
in
the
reviews
.
The
work
on
this
paper
was
supported
by
a
Mon-bukagakusho
21st
COE
Program
.
