In
this
paper
,
we
present
a
three-step
multilingual
dependency
parser
based
on
a
deterministic
shift-reduce
parsing
algorithm
.
Different
from
last
year
,
we
separate
the
root-parsing
strategy
as
sequential
labeling
task
and
try
to
link
the
neighbor
word
dependences
via
a
near
neighbor
parsing
.
The
outputs
of
the
root
and
neighbor
parsers
were
encoded
as
features
for
the
shift-reduce
parser
.
In
addition
,
the
learners
we
used
for
the
two
parsers
and
the
shift-reduce
parser
are
quite
different
(
conditional
random
fields
and
the
modified
finite-Newton
method
support
vector
machines
)
.
We
found
that
our
method
could
benefit
from
the
two-preprocessing
stages
.
To
speed
up
training
,
in
this
year
,
we
employ
the
MFN-SVM
(
modified
finite-Newton
method
support
vector
machines
)
which
can
be
learned
in
linear
time
.
The
experimental
results
show
that
our
method
achieved
the
middle
rank
over
the
23
teams
.
We
expect
that
our
method
could
be
further
improved
via
well-tuned
parameter
validations
for
different
languages
.
1
Introduction
The
target
of
dependency
parsing
is
to
automatically
recognize
the
head-modifier
relationships
between
words
in
natural
language
sentences
.
Usually
,
a
dependency
parser
can
construct
a
similar
grammar
tree
with
the
dependency
graph
.
In
this
year
,
CoNLL-2007
shared
task
(
Nivre
et
al.
,
2007
)
focuses
on
multilingual
dependency
parsing
based
on
ten
different
languages
(
Hajic
et
al.
,
2004
;
Aduriz
et
al.
,
2003
;
Marti
et
al.
,
2007
;
Chen
et
al.
,
2003
;
Bohmova
et
al.
,
2003
;
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Prokopidis
et
al.
,
2005
;
Czendes
et
al.
,
2005
;
Montemagni
et
al.
,
2003
;
Oflazer
et
al.
,
2003
)
and
domain
adaptation
for
English
(
Marcus
et
al.
,
1993
;
Johansson
and
Nugues
,
2007
;
Kulick
et
al.
,
2004
;
MacWhinney
,
2000
;
Brown
,
1973
)
without
taking
the
language-specific
knowledge
into
consideration
.
The
ultimate
goal
of
them
is
to
design
ideal
multilingual
and
domain
portable
dependency
parsing
systems
.
To
accomplish
the
multilingual
and
domain
adaptation
tasks
,
we
present
a
three-pass
parsing
model
based
on
a
shift-reducing
algorithm
(
Ya-mada
and
Matsumoto
,
2003
;
Chang
et
al.
,
2006
)
,
namely
,
neighbor
parsing
,
root
relation
parsing
,
and
shift-reduce
parsing
.
Our
method
favors
examining
the
"
un-parsed
"
tokens
,
which
incrementally
shrink
.
At
the
beginning
,
the
parsing
direction
is
mainly
determined
by
the
amount
of
un-parsed
tokens
in
the
sentence
with
either
forward
or
backward
parse
.
In
this
step
,
the
projective
parsing
method
can
be
used
to
evaluate
most
of
the
non-projective
Treebank
datasets
.
Once
the
direction
is
determined
,
the
pseudo-projectivize
transformation
algorithm
(
Nivre
and
Nilsson
,
2005
)
converts
most
non-projective
training
data
into
projective
and
decodes
the
parsed
text
into
non-projective
.
Hereafter
,
both
neighbor-parser
and
root-parser
were
trained
to
discovery
additional
features
for
the
downstream
shift-reduce
parse
model
.
We
found
that
the
two
additional
features
could
improve
the
performance
.
Subsequently
,
the
modified
shift-reduce
parsing
algorithm
starts
to
parse
the
final
dependencies
with
two-pass
processing
,
i.e.
,
predict
parse
action
and
label
the
relations
.
In
the
remainder
of
this
paper
,
Section
2
describes
the
proposed
parsing
model
,
and
Section
3
lists
the
experimental
settings
and
results
.
Section
4
presents
the
discussion
and
analysis
of
our
parser
.
In
Section
5
,
we
draw
the
future
direction
and
conclusion
.
2
System
Description
Over
the
past
decades
,
many
state-of-the-art
parsing
algorithm
were
proposed
,
such
as
head-word
lexicalized
PCFG
(
Collins
,
1998
)
,
Maximum
Entropy
(
Charniak
,
2000
)
,
Maximum
/
Minimum
al.
(
2006
)
further
added
the
"
wait-right
"
action
to
the
words
that
had
children
and
could
not
be
reduced
in
current
state
.
This
could
avoid
the
so-called
"
too
early
reduce
"
problems
.
The
overall
parsing
model
can
be
found
in
Figure
1
.
Figure
2
illustrates
the
detail
system
spec
of
our
parsing
model
.
Training
Data
(
CoNLL
Format
)
Data
(
CoNLL
Format
)
Pseudo
Projective
Encoding
CRF
Learner
(
root
label
)
Neighborhood
Parser
Root
Parser
SVM-MFN
Learner
Parsing
Algorithm
(
+Wait_Left
)
Pseudo
Projective
Decoding
Final
Parse
Figure
1
:
System
architecture
As
shown
in
Figure
1
,
the
first
step
is
to
identify
the
neighbor
head-modifier
relations
between
two
consecutive
words
.
Cheng
et
al.
(
2006
)
also
reported
that
the
use
of
neighboring
dependency
attachment
tagger
enhance
the
unlabeled
attachment
scores
from
84.38
to
84.6
for
13
languages
.
Usually
,
it
is
the
case
that
the
select
features
are
fixed
and
could
not
be
tuned
to
capture
the
second
order
features
(
McDonald
et
al.
,
2006
)
.
At
each
location
,
there
the
focus
and
next
words
are
always
compared
.
It
may
fail
to
link
the
next
and
next+1
word
pair
since
the
next
word
might
be
reduced
due
to
an
earlier
wrong
decision
.
I.
Parsing
Algorithm
:
Parser
Characteristics
:
VI
.
Post-Processing
:
Additional
/
External
Resources
:
Neighbor
Parser
Deterministic
Pseudo-Projective
en
(
de
)
-
coding
Non-Used
Figure
2
:
System
spec
However
,
starting
parsing
based
on
the
result
of
neighbor
parsing
is
not
a
good
idea
since
it
could
produce
error
propagation
problems
.
Rather
,
we
include
the
result
of
our
neighbor
parsing
as
features
to
increase
the
original
feature
set
.
In
the
preliminary
study
,
we
found
that
the
derived
features
are
very
useful
for
most
languages
.
As
conventional
sequential
tagging
problems
,
such
part-of-speech
tagging
and
phrase
chunking
,
we
employ
the
conditional
random
fields
(
CRF
)
as
learners
(
Kudo
et
al.
,
2004
)
.
The
basic
idea
of
the
neighbor
parsing
can
be
shown
in
Figure
3
.
The
first
and
second
colums
in
Figure
3
represents
the
basic
word
and
fine-grained
POS
froms
,
while
the
third
column
indicates
if
this
word
has
the
LH
(
left-head
)
or
RH
(
right-head
)
with
associated
relations
or
O
(
no
neighbor
head
in
either
left
or
right
neighbor
word
)
.
The
used
features
are
:
Word
,
fine-grained
POS
,
bigram
,
and
bi-POS
with
context
window
=
2
(
left
)
and
4
(
right
)
far
RB
LH_AMOD
holds
VBZ
O
important
JJ
RH_NMOD
lessons
NNS
O
for
IN
LH_NMOD
companies
NNS
LH_PMOD
Right
Head
in
NMod
type
Figure
3
:
Sequential
tagging
model
for
neighbor
Unfortunately
,
for
some
languages
,
like
Chinese
and
Czech
,
training
with
CRF
is
because
of
the
large
number
of
features
and
the
head
relations
.
To
make
it
practical
,
we
focus
on
just
three
types
:
left
head
,
right
head
,
and
out-of-neighbor
.
This
effectively
reduces
most
of
the
feature
space
for
the
CRF
.
The
training
time
for
the
neighbor
parser
with
only
three
categories
is
less
than
5
minutes
while
it
takes
three
days
with
taking
all
the
relation
tag
into
account
.
After
the
neighbor
parse
,
the
tagged
labels
are
good
features
for
the
root
parse
.
In
the
second
stage
,
the
root
parser
identifies
the
root
words
in
the
sentence
.
Nevertheless
,
for
some
languages
,
such
as
Arabic
and
Czech
,
the
roots
might
be
several
types
as
against
to
Chinese
and
English
in
which
the
number
of
labels
of
roots
is
merely
one
.
Similar
to
the
neighbor
parser
,
we
also
take
the
root
label
into
account
.
As
noted
,
for
Chinese
and
English
,
the
goal
of
the
root
parser
can
be
reduced
to
determine
whether
the
current
word
is
root
or
not
.
surprising
JJ
RH_NMOD
O
progress
NN
O
O
so
RB
O
O
important
JJ
RH_NMOD
O
lessons
NNS
O
O
for
IN
LH_NMOD
O
companies
NNS
LH_PMOD
O
Figure
4
:
Sequential
tagging
model
for
neighbor
parse
Similar
to
the
neighbor
parse
,
the
root
parsing
task
can
also
be
treated
as
a
sequential
tagging
problem
.
Figure
4
shows
the
basic
concept
of
the
root
parser
.
The
third
column
is
mainly
derived
from
the
neighbor
parser
,
while
the
fourth
column
represents
whether
the
current
word
is
a
root
with
relation
or
not
.
2.3
Parsing
Algorithm
After
adding
the
neighbor
and
root
parser
output
as
features
,
in
the
final
stage
,
the
modified
Yamada
's
shift-reduce
parsing
algorithm
(
Yamada
and
Ma-tsumoto
,
2003
)
is
then
run
.
This
method
is
deterministic
and
can
deal
with
projective
data
only
.
There
are
three
basic
operation
(
action
)
types
:
Shift
(
S
)
,
Left
(
L
)
,
and
Right
(
R
)
.
The
operation
is
mainly
determined
via
the
classifier
according
to
the
selected
features
(
see
2.4
)
.
Each
time
,
the
operation
is
applied
to
two
unparsed
words
,
namely
,
focus
and
next
.
If
there
exists
an
arc
between
the
two
words
(
either
left
or
right
)
,
then
the
head
of
focus
or
next
word
is
found
;
otherwise
(
i.e.
,
shift
)
,
next
two
words
are
considered
at
next
stage
.
This
method
could
be
economically
performed
via
maintaining
two
pointers
,
focus
,
and
next
without
an
explicit
stack
.
The
parse
operation
is
iteratively
run
until
no
more
relation
can
be
found
in
the
sentence
.
In
2006
,
Chang
et
al.
(
2006
)
further
reported
that
the
use
of
"
step-back
"
in
comparison
to
the
original
"
stay
"
.
Furthermore
,
they
also
add
the
"
wait-left
"
operations
to
prevent
the
"
too
early
reduce
"
problems
.
In
this
way
,
the
parse
actions
can
be
reduced
to
be
bound
in
3
«
where
n
is
the
number
of
words
in
a
sentence
.
Now
we
compare
the
adopted
parsing
algorithm
in
this
year
to
the
one
we
employed
last
year
(
Wu
et
al.
,
2006a
)
.
The
common
characteristics
are
:
3
.
linearly
scaled
4
.
deterministic
and
projective
On
the
contrary
,
their
parse
actions
are
quite
different
.
Therefore
these
two
methods
have
different
run
time
.
This
gives
the
two
methods
rise
to
different
iterative
times
.
The
main
reason
is
that
the
step-back
might
trace
back
to
previous
words
,
which
can
be
viewed
as
pop
the
top
words
on
the
stack
back
to
the
unparsed
strings
,
while
the
Nivre
's
method
does
not
trace-back
any
two
words
in
the
stack
.
In
other
words
,
if
a
word
is
pushed
into
the
stack
,
it
will
no
longer
be
compared
with
the
other
deeper
words
inside
the
stack
.
Hence
some
of
the
non-root
words
in
the
stack
remain
to
be
parsed
.
A
simple
solution
is
to
adopt
an
exhaustive
post-processing
step
for
the
unparsed
words
in
the
stack
(
details
in
(
Wu
et
al.
,
2006a
,
2006b
)
)
.
A
good
advantage
of
the
step-back
is
that
it
can
trace
back
to
the
unparsed
words
in
the
stack
.
But
theoretically
,
the
required
parse
actions
still
more
than
the
Nivre
's
algorithm
(
2n
vs.
3n
)
.
By
adopting
the
projectivized
en
/
de-coding
over
the
modified
Yamada
's
algorithm
,
we
can
treat
the
words
that
do
not
have
a
parent
as
roots
.
Thus
,
for
some
languages
(
e.g.
Czech
and
Arabic
)
,
the
multiple
root
problem
can
be
easily
solved
.
In
this
year
we
separate
the
parse
action
and
the
relation
label
into
two
stages
as
opposed
to
having
one
pass
last
year
.
In
this
way
,
we
can
simply
adopt
a
sequential
tagger
to
auto-assign
the
relation
labels
after
the
whole
sentence
is
parsed
.
2.4
Features
and
Learners
Unlike
last
year
,
we
did
separate
the
action
prediction
and
the
label
recognition
into
two
stages
where
the
one
of
the
learners
could
provide
more
information
to
another
.
The
used
features
of
the
two
learners
are
quite
similar
and
listed
as
follows
:
Enhanced
feature
type
:
Bigram
,
BiPOS
for
focus
and
next
words
previous
two
parse
actions
For
label
recognition
:
Label
tag
to
its
head
,
label
tags
for
previous
two
words
In
this
paper
,
we
replicate
and
modify
the
modified
finite
Newton
support
vector
machines
(
MFN-SVM
)
(
Keerthi
and
DeCoste
,
2005
)
as
the
learner
.
The
MFN-SVM
is
a
very
efficient
SVM
optimization
method
which
linearly
scales
with
the
number
of
training
examples
.
Usually
,
the
trained
models
from
MFN-SVM
are
quite
large
that
could
not
be
processed
in
practice
.
We
therefore
defined
the
positive
lower
bound
(
10-10
)
and
the
negative
upper
bound
(
-10-10
)
to
eliminate
values
that
tend
to
be
zero
.
However
,
the
SVM
is
a
binary
classifier
which
only
recognizes
true
or
false
.
For
multiclass
problem
,
we
use
the
so-called
one-versus-all
(
OVA
)
method
with
linear
kernel
to
combine
the
results
of
each
individual
classifier
.
The
final
class
in
testing
phase
is
mainly
determined
by
selecting
the
maximum
similarity
.
For
all
languages
,
our
parser
uses
the
same
settings
and
features
.
For
all
the
languages
(
except
for
Basque
and
Turkish
)
,
we
use
backward
parsing
direction
to
keep
the
un-parsed
token
rate
low
.
3
Experimental
Result
3.1
Dataset
and
Evaluation
Metrics
The
testing
data
is
provided
by
the
(
Nivre
et
al.
,
2007
)
which
consists
of
10
language
treebanks
.
More
detailed
descriptions
of
the
dataset
can
be
found
at
the
web
site1
.
The
experimental
results
are
mainly
evaluated
by
the
unlabeled
and
labeled
attachment
scores
.
CoNLL
also
provided
a
perl
script
to
automatic
compute
these
rates
.
Table
1
presents
the
overall
parsing
performance
of
the
10
languages
.
As
shown
in
Table
1
,
we
list
two
parsing
results
at
column
B
and
column
C
(
new
and
old
)
.
It
is
worth
to
note
that
the
result
B
is
produced
by
training
the
neighbor
parser
with
full
labels
instead
of
the
three
categories
,
left
/
right
/
out-of-neighbor
.
A
is
the
official
provided
parse
results
.
Some
of
the
parsing
results
in
A
did
not
include
the
enhanced
feature
type
and
neighbor
/
root
parses
due
to
the
time
limitation
.
For
the
domain
adaptation
task
,
we
directly
use
the
trained
English
model
to
classify
the
PChemtb
and
CHILDES
corpora
without
further
adjustment
.
In
addition
,
we
also
apply
the
Maltparser
0.4
,
which
is
implemented
with
the
Nivre
's
algorithm
(
Nivre
et
al.
,
2006
)
to
be
compared
.
The
Maltpaser
also
includes
the
SVM
and
memory-based
learner
(
MBL
)
.
Nevertheless
,
the
training
time
complexity
of
the
SVM
in
Maltparser
is
not
linear
time
as
MFN-SVM
.
Therefore
we
use
the
default
MBL
and
feature
model
3
(
M3
)
in
this
experiment
.
To
make
a
fair
comparison
,
the
input
training
data
was
also
projectivized
through
the
same
pseudo-projective
encoding
/
decoding
methods
.
1
http
:
/
/
nextens.uvt.nl
/
depparse-wiki
/
SharedTaskWebsite
Table
1
;
A
general
statistical
table
of
labeled
attachment
score
,
test
and
un-parsed
rate
(
percentage
)
Language
(
Official
)
(
Corrected
)
Statistic
test
Un-Parsed
Rate
Hungarian
pchemtb
closed
*
CHILDES_closed
*
The
CHILDES
data
does
not
contain
the
relation
tag
,
instead
,
the
unlabeled
attachment
score
is
listed
*
*
The
original
submission
of
the
pchemtbclosed
task
can
not
pass
through
the
evaluator
and
hence
is
not
the
official
score
.
After
correcting
the
format
problems
,
the
actual
LAS
score
should
be
55.31
.
To
perform
the
significant
test
,
we
evaluate
the
statistical
difference
among
the
three
results
.
If
the
answer
is
"
Yes
"
,
it
means
the
two
systems
are
significant
difference
under
at
least
95
%
confidence
score
(
p
&lt;
0.05
)
.
The
final
column
of
the
Table
1
lists
the
non-root
words
unparsed
rate
of
the
modified
Ya-mada
's
method
and
the
Nivre
's
parsing
model
which
we
employed
last
year
.
Among
10
languages
,
we
can
find
that
the
modified
Yamada
's
method
outperform
our
old
method
in
five
languages
,
while
fail
to
win
in
three
languages
.
We
did
not
report
the
comparative
study
between
the
forward
parsing
and
backward
parsing
directions
here
since
only
the
two
languages
(
Basque
and
Turkish
)
were
better
in
performing
forward
direction
.
4
Discussion
Now
we
turn
to
discuss
the
improvement
of
the
use
of
the
neighbor
parse
and
root
parse
.
All
of
the
experiments
were
conducted
by
additional
runs
where
we
removed
the
neighbor
and
root
parse
outputs
from
the
feature
set
.
In
this
experiment
,
we
report
four
representative
languages
that
tend
to
achieve
the
best
and
worst
improvements
.
Table
2
lists
the
comparative
study
of
the
four
languages
.
As
listed
in
Table
2
,
both
English
and
Chinese
got
substantial
benefit
from
the
use
of
the
two
parsers
.
As
observed
by
(
Isozaki
et
al.
,
2004
)
,
incorporating
both
top-down
(
root
find
)
and
bottom-up
(
base-NP
)
can
yield
better
improvement
over
the
Yamada
's
parsing
algorithm
.
Thus
,
instead
of
pre-determining
the
root
and
base-phrase
structures
,
the
tagging
results
of
the
neighbor
and
root
parsers
were
included
as
new
features
to
add
wider
information
for
the
shift-reduce
parser
.
It
is
also
interesting
to
link
neighbors
and
determine
the
root
before
parsing
.
We
plan
to
compare
it
with
out
method
in
the
future
.
Table
2
:
The
effective
of
the
used
Neighbor
/
Root
Parser
in
the
selected
four
languages
With
N
/
R
Parser
Without
On
the
other
hand
,
we
also
found
that
2
out
of
the
10
languages
had
been
negatively
affected
by
the
neighbor
and
root
parsers
.
In
Basque
they
made
a
marginally
negative
improvement
,
and
in
the
Turkish
the
two
parsers
did
decrease
the
original
parsing
models
.
We
further
observed
that
the
main
cause
is
that
the
weak
performance
of
the
neighbor
parser
.
In
Turkish
,
the
recall
/
precision
rates
of
the
neighbor
dependence
are
92.61
/
93.12
with
include
neighbor
parse
outputs
,
while
it
achieved
93.71
/
93.51
with
purely
run
the
modified
Ya-mada
's
method
.
We
can
expect
that
the
result
could
achieve
higher
LAS
score
when
the
neighbor
parser
is
improved
.
As
mentioned
in
section
2.1
,
2.2
,
the
selected
features
for
the
two
parsers
are
unified
for
the
10
languages
.
It
is
not
surprising
that
for
certain
data
the
fixed
feature
set
might
perform
even
worse
than
the
original
shift-reduce
parser
.
A
better
way
is
to
validate
the
features
with
variant
settings
for
different
languages
.
We
left
the
feature
engine
task
as
future
work
.
5
Conclusion
and
Future
Remarks
Multilingual
dependency
parsing
investigates
on
proposing
a
general
framework
of
dependence
parsing
algorithms
.
This
paper
presents
and
analyzes
the
impact
of
two
preprocessing
components
,
namely
,
neighbor
parsing
and
root-parsing
.
Those
two
parsers
provide
very
useful
additional
features
for
downstream
shift-reduce
parser
.
The
experimental
results
also
demonstrated
that
the
use
of
the
two
components
did
improve
results
for
the
selected
languages
.
In
the
error-analysis
,
we
also
observed
that
for
some
languages
,
parameter
tuning
and
feature
selection
is
very
important
for
system
performance
.
In
the
future
,
we
plan
to
report
the
actual
performance
with
replacing
the
MFN-SVM
by
the
polynomial
kernel
SVM
.
In
our
pilot
study
,
the
use
of
approximate-polynomial
kernel
(
Wu
et
al.
,
2007
)
outperforms
the
linear
kernel
SVM
in
Chinese
and
Arabic
.
Also
,
we
are
investigating
how
to
convert
the
shift-reduce
parser
into
approximate
N-best
parser
efficiently
.
In
this
way
,
the
parse
reranking
algorithm
can
be
adopted
to
further
improve
the
performance
.
