We
describe
our
experiments
using
the
DeSR
parser
in
the
multilingual
and
domain
adaptation
tracks
of
the
CoNLL
2007
shared
task
.
DeSR
implements
an
incremental
deterministic
Shift
/
Reduce
parsing
algorithm
,
using
specific
rules
to
handle
non-projective
dependencies
.
For
the
multilingual
track
we
adopted
a
second
order
averaged
perceptron
and
performed
feature
selection
to
tune
a
feature
model
for
each
language
.
For
the
domain
adaptation
track
we
applied
a
tree
revision
method
which
learns
how
to
correct
the
mistakes
made
by
the
base
parser
on
the
adaptation
domain
.
1
Introduction
Classifier-based
dependency
parsers
(
Yamada
and
Matsumoto
,
2003
;
Nivre
and
Scholz
,
2004
)
learn
from
an
annotated
corpus
how
to
select
an
appropriate
sequence
of
Shift
/
Reduce
actions
to
construct
the
dependency
tree
for
a
sentence
.
Learning
is
based
on
techniques
such
as
SVM
(
Vapnik
1998
)
or
Memory
Based
Learning
(
Daelemans
2003
)
,
which
provide
high
accuracy
but
are
often
computationally
expensive
.
For
the
multilingual
track
in
the
CoNLL
2007
Shared
Task
,
we
employed
a
Shift
/
Reduce
parser
which
uses
a
perceptron
algorithm
with
second-order
feature
maps
,
in
order
to
verify
whether
a
simpler
and
faster
algorithm
can
still
achieve
comparable
accuracy
.
For
the
domain
adaptation
track
we
wished
to
explore
the
use
of
tree
revisions
in
order
to
incorporate
language
knowledge
from
a
new
domain
.
2
Multilingual
Track
The
overall
parsing
algorithm
is
a
deterministic
classifier-based
statistical
parser
,
which
extends
the
approach
by
Yamada
and
Matsumoto
(
2003
)
,
by
using
different
reduction
rules
that
ensure
deterministic
incremental
processing
of
the
input
sentence
and
by
adding
specific
rules
for
handling
non-projective
dependencies
.
The
parser
also
performs
dependency
labeling
within
a
single
processing
step
.
The
parser
is
modular
and
can
use
several
learning
algorithms
.
The
submitted
runs
used
a
second
order
Average
Perceptron
,
derived
from
the
multiclass
perceptron
of
Crammer
and
Singer
(
2003
)
.
No
additional
resources
were
used
.
No
preprocessing
or
post-processing
was
used
,
except
stemming
for
English
,
by
means
of
the
Snowball
stemmer
(
Porter
2001
)
.
3
Deterministic
Classifier-based
Parsing
DeSR
(
Attardi
,
2006
)
is
an
incremental
deterministic
classifier-based
parser
.
The
parser
constructs
dependency
trees
employing
a
deterministic
bottom-up
algorithm
which
performs
Shift
/
Reduce
actions
while
analyzing
input
sentences
in
left-to-right
order
.
Rightd
Leftd
quadruple
(
S
,
I
,
T
,
A
)
,
where
S
is
the
stack
of
past
tokens
,
I
is
the
list
of
(
remaining
)
input
tokens
,
T
is
a
stack
of
temporary
tokens
and
A
is
the
arc
relation
for
the
dependency
graph
.
Given
an
input
string
W
,
the
parser
is
initialized
to
(
(
)
,
W
,
(
)
,
(
)
)
,
and
terminates
when
it
reaches
a
configuration
(
S
,
(
)
,
(
)
,
A
)
.
The
three
basic
parsing
rule
schemas
are
as
follows
:
(
s
\
S
,
n
\
I
,
T
,
A
)
(
S
,
s
\
I
,
T
,
Au
{
(
n
,
d
,
s
)
}
)
The
schemas
for
the
Left
and
Right
rules
are
instantiated
for
each
dependency
type
d
e
D
,
for
a
total
of
2
\
D
\
+
1
rules
.
These
rules
perform
both
attachment
and
labeling
.
At
each
step
the
parser
uses
classifiers
trained
on
a
treebank
corpus
in
order
to
predict
which
action
to
perform
and
which
dependency
label
to
assign
given
the
current
configuration
.
4
Non-Projective
Relations
For
handling
non-projective
relations
,
Nivre
and
Nilsson
(
2005
)
suggested
applying
a
preprocessing
step
to
a
dependency
parser
,
which
consists
in
lifting
non-projective
arcs
to
their
head
repeatedly
,
until
the
tree
becomes
pseudo-projective
.
A
post-processing
step
is
then
required
to
restore
the
arcs
to
the
proper
heads
.
In
DeSR
non-projective
dependencies
are
handled
in
a
single
step
by
means
of
the
following
additional
parsing
rules
,
slightly
different
from
those
Right2d
Left2d
Right3d
Left3d
Extract
Insert
Left2
,
Right2
are
similar
to
Left
and
Right
,
except
that
they
create
links
crossing
one
intermediate
node
,
while
Left3
and
Right3
cross
two
intermediate
nodes
.
Notice
that
the
RightX
actions
put
back
on
the
input
the
intervening
tokens
,
allowing
the
parser
to
complete
the
linking
of
tokens
whose
processing
had
been
delayed
.
Extract
/
Insert
generalize
the
previous
rules
by
moving
one
token
to
the
stack
T
and
reinserting
the
top
of
T
into
S.
5
Perceptron
Learning
and
2nd-Order
Feature
Maps
The
software
architecture
of
the
DeSR
parser
is
modular
.
Several
learning
algorithms
are
available
,
including
SVM
,
Maximum
Entropy
,
Memory-Based
Learning
,
Logistic
Regression
and
a
few
variants
of
the
perceptron
algorithm
.
We
obtained
the
best
accuracy
with
a
multiclass
averaged
perceptron
classifier
based
on
the
ultraconservative
formulation
of
Crammer
and
Singer
(
2003
)
with
uniform
negative
updates
.
The
classifier
function
is
:
F
(
x
)
=
argmax
{
ak
■
x
}
where
each
parsing
action
k
is
associated
with
a
weight
vector
ak
.
To
regularize
the
model
the
final
weight
vectors
are
computed
as
the
average
of
all
weight
vectors
posited
during
training
.
The
number
of
learning
iterations
over
the
training
data
,
which
is
the
only
adjustable
parameter
of
the
algorithm
,
was
determined
by
cross-validation
.
In
order
to
overcome
the
limitations
of
a
linear
perceptron
,
we
introduce
a
feature
map
O
:
]
R
—
]
Rd
(
d+1
)
/
2
that
maps
a
feature
vector
x
into
a
higher
dimensional
feature
space
consisting
of
all
unordered
feature
pairs
:
In
other
words
we
expand
the
original
representation
in
the
input
space
with
a
feature
map
that
generates
all
second-order
feature
combinations
from
each
observation
.
We
call
this
the
2nd-order
model
,
where
the
inner
products
are
computed
as
ak
•
O
(
x
)
,
with
ak
a
vector
of
dimension
d
(
d+1
)
/
2
.
Applying
a
linear
perceptron
to
this
feature
space
corresponds
to
simulating
a
polynomial
kernel
of
degree
two
.
A
polynomial
kernel
of
degree
two
for
SVM
was
also
used
by
Yamada
and
Matsumoto
(
2003
)
.
However
,
training
SVMs
on
large
data
sets
like
those
arising
from
a
big
training
corpus
was
too
computationally
expensive
,
forcing
them
to
resort
to
partitioning
the
training
data
(
by
POS
)
and
to
learn
several
models
.
Our
implementation
of
the
perceptron
algorithm
uses
sparse
data
structures
(
hash
maps
)
so
that
it
can
handle
efficiently
even
large
feature
spaces
in
a
single
model
.
For
example
the
feature
space
for
the
2nd-order
model
for
English
contains
over
21
million
.
Parsing
unseen
data
can
be
performed
at
tens
of
sentences
per
second
.
More
details
on
such
aspects
of
the
DeSR
parser
can
be
found
in
(
Ciaramita
and
Attardi
2007
)
.
The
base
parser
was
tuned
on
several
parameters
to
optimize
its
accuracy
as
follows
.
6.1
Feature
Selection
Given
the
different
characteristics
of
languages
and
corpus
annotations
,
it
is
worth
while
to
select
a
different
set
of
features
for
each
language
.
For
example
,
certain
corpora
do
not
contain
lemmas
or
morphological
information
so
lexical
information
will
be
useful
.
Vice
versa
,
when
lemmas
are
present
,
lexical
information
might
be
avoided
,
reducing
the
size
of
the
feature
set
.
We
performed
a
series
of
feature
selection
experiments
on
each
language
,
starting
from
a
fairly
comprehensive
set
of
43
features
and
trying
all
variants
obtained
by
dropping
a
single
feature
.
The
best
of
these
alternatives
feature
models
was
chosen
and
the
process
iterated
until
no
further
gains
were
achieved
.
The
score
for
the
alternatives
was
computed
on
a
development
set
of
approximately
5000
tokens
,
extracted
from
a
split
of
the
original
training
corpus
.
Despite
the
process
is
not
guaranteed
to
produce
a
global
optimum
,
we
noticed
LAS
improvements
of
up
to
4
percentage
points
on
some
languages
.
The
set
of
features
to
be
used
by
DeSR
is
controlled
by
a
number
of
parameters
supplied
through
a
parameter
file
.
Each
parameter
describes
a
feature
and
from
which
tokens
to
extract
it
.
Tokens
are
referred
through
positive
numbers
for
input
tokens
and
negative
numbers
for
tokens
on
the
stack
.
For
example
PosFeatures
-2
-1
0
1
2
3
means
to
use
the
POS
tag
of
the
first
two
tokens
on
the
stack
and
of
the
first
four
tokens
on
the
input
.
The
parameter
PosPrev
refers
to
the
POS
of
the
preceding
token
in
the
original
sentence
,
PosLeftChild
refers
to
the
POS
of
the
left
children
of
a
token
,
PastActions
tells
how
many
previous
actions
to
include
as
features
.
The
selection
process
was
started
from
the
following
base
feature
model
:
LexFeatures
Mo
rphoFeatures
DepFeatures
PastActions
The
selection
process
produced
different
variants
for
each
language
,
sometimes
suggesting
dropping
certain
intermediate
features
,
like
the
lemma
of
the
third
next
input
token
in
the
case
of
Catalan
:
LemmaFeatures
LemmaPrev
LemmaSucc
LemmaLeftChild
LemmaRightChild
PosFeatures
PosLeftChild
PosRightChild
CPosFeatures
MorphoFeatures
DepLeftChild
DepRightChild
For
Italian
,
instead
,
we
ran
a
series
of
tests
in
parallel
using
a
set
of
manually
prepared
feature
models
.
The
best
of
these
models
achieved
a
LAS
of
80.95
%
.
The
final
run
used
this
model
with
the
addition
of
the
morphological
agreement
feature
discussed
below
.
English
was
the
only
language
for
which
no
feature
selection
was
done
and
for
which
lexical
features
Hungarian
Table
1
.
Multilingual
track
official
scores
.
were
used
.
English
is
also
the
language
where
the
official
score
is
significantly
lower
than
what
we
had
been
getting
on
our
development
set
(
90.01
%
UAS
)
.
6.2
Prepositional
Attachment
Certain
languages
,
such
as
Catalan
,
use
detailed
dependency
labeling
,
that
for
instance
distinguish
between
adverbials
of
location
and
time
.
We
exploited
this
information
by
introducing
a
feature
that
captures
the
entity
type
of
a
child
of
the
top
word
on
the
stack
or
in
the
input
.
During
training
a
list
of
nouns
occurring
in
the
corpus
as
dependent
on
prepositions
with
label
CCL
(
meaning
'
complement
of
location
'
for
Catalan
)
was
created
and
similarly
for
CCT
(
complement
of
time
)
.
The
entity
type
TIME
is
extracted
as
a
feature
depending
on
whether
the
noun
occurs
in
the
time
list
more
than
a
times
than
in
the
location
list
,
and
similarly
for
the
feature
LOCATION
.
a
was
set
to
1.5
in
our
experiments
.
6.3
Morphological
Agreement
Certain
languages
require
gender
and
number
agreement
between
head
and
dependent
.
The
feature
MorphoAgreement
is
computed
for
such
languages
and
provided
noticeable
accuracy
improvements
.
For
example
,
for
Italian
,
the
improvement
was
from
:
7
Accuracy
Table
1
reports
the
accuracy
scores
in
the
multilingual
track
.
They
are
all
considerably
above
the
average
and
within
2
%
from
the
best
for
Catalan
,
3
%
for
Chinese
,
Greek
,
Italian
and
Turkish
.
8
Performance
The
experiments
were
performed
on
a
2.4
Ghz
9
Error
Analysis
on
Catalan
The
parser
achieved
its
best
score
on
Catalan
,
so
we
performed
an
analysis
on
its
output
for
this
language
.
Among
the
42
dependency
relations
that
the
parser
had
to
assign
to
a
sentence
,
the
largest
number
of
errors
occurred
assigning
CC
(
124
)
,
SP
(
33
)
,
CD
(
27
)
,
SUJ
(
26
)
,
CONJUNCT
(
22
)
,
SN
(
23
)
.
The
submitted
run
for
Catalan
did
not
use
the
entity
feature
discussed
earlier
and
indeed
67
errors
were
due
to
assigning
CCT
or
CCL
instead
of
CC
(
generic
complement
of
circumstance
)
.
However
over
half
of
these
appear
as
underspecified
annotation
errors
in
the
corpus
rather
than
parser
errors
.
By
adding
the
Chi
ldEntityType
feature
,
which
distinguishes
better
between
CCT
and
CCL
,
the
UAS
improved
,
while
the
LAS
dropped
slightly
,
due
to
the
effect
of
underspecified
annotations
in
the
corpus
:
A
peculiar
aspect
of
the
original
Catalan
corpus
was
the
use
of
a
large
number
(
195
)
of
dependency
labels
.
These
labels
were
reduced
to
42
in
the
version
used
for
CoNNL
2007
,
in
order
to
make
it
comparable
to
other
corpora
.
However
,
performing
some
preliminary
experiments
using
the
original
Catalan
collection
with
all
195
dependency
labels
,
the
DeSR
parser
achieved
a
significantly
better
score
:
This
suggests
that
accuracy
might
improve
for
other
languages
as
well
if
the
training
corpus
was
labeled
with
more
precise
dependencies
.
10
Adaptation
Track
The
adaptation
track
originally
covered
two
domains
,
the
CHILDES
and
the
Chemistry
domain
.
The
CHILDES
(
Brown
,
1973
;
MacWhinney
,
2000
)
consists
of
transcriptions
of
dialogues
with
children
,
typically
short
sentences
of
the
kind
:
Would
you
like
more
grape
juice
?
That
'
s
a
nice
box
of
books
.
Phrases
are
short
,
half
of
them
are
questions
.
The
only
difficulty
that
appeared
from
looking
at
the
unlabeled
collection
supplied
for
training
in
the
domain
was
the
presence
of
truncated
terms
like
goin
(
for
going
)
,
d
(
for
did
)
,
etc.
However
none
of
these
unusually
spelled
words
appeared
in
the
test
set
,
so
a
normal
English
parser
performed
reasonably
well
on
this
task
.
Because
of
certain
inconsistencies
in
the
annotation
guidelines
,
the
organizers
decided
to
make
this
task
optional
and
hence
we
submitted
just
the
parse
produced
by
the
parser
trained
for
English
.
For
the
second
adaptation
task
we
were
given
a
large
collection
of
unlabeled
data
in
the
chemistry
domain
(
Kulick
et
al
,
2004
)
as
well
as
a
test
set
of
5000
tokens
(
200
sentences
)
to
parse
(
eng-7ish_pchemtbtb_test
.
conll
)
.
There
were
three
sets
of
unlabeled
documents
:
we
chose
the
smallest
(
un7ab1
)
consisting
of
over
300,000
tokens
(
11663
sentences
)
.
unlab1
was
tokenized
,
POS
and
lemmas
were
added
using
our
version
of
TreeTagger
(
Schmid
,
1994
)
,
and
lemmas
replaced
with
stems
,
which
had
turned
out
to
be
more
effective
than
lemmas
.
We
call
this
set
pchemtb_un
7
ab1
.
con
77
.
(
2007
)
.
We
added
stems
and
produced
a
parser
called
DeSRwsj
.
By
parsing
eng
-
7
ish_pchem_test
.
con7
7
with
DeSRwsj
we
obtained
pchemtb_test_base.desr
,
our
baseline
for
the
task
.
By
visual
inspection
using
DgAnnotator
(
DgAnnotator
,
2006
)
,
the
parses
looked
generally
correct
.
Most
of
the
errors
seemed
due
to
improper
handling
of
conjunctions
and
disjunctions
.
The
collection
in
fact
contains
several
phrases
like
:
the
activation
in
liver
microsomes
from
rats
pretreated
with
PB
,
BNF
,
INH
and
DEX
respectively
The
parser
did
not
seem
to
have
much
of
a
problem
with
terminology
,
possibly
because
the
supplied
gold
POS
were
adequate
.
For
the
adaptation
we
proceeded
as
follows
.
We
parsed
pchemtb_un7ab1.con7
7
using
DeSRwsj
obtaining
pchemtb_un7ab1
.
desr
.
We
then
extracted
a
set
of
12,500
sentences
from
ptb_train.con77
and
7,500
sentences
from
pchemtb_un7ab1.desr
,
creating
a
corpus
of
20,000
sentences
called
combined.con7
7
.
In
both
cases
the
selection
criteria
was
to
choose
sentences
shorter
than
30
tokens
.
We
then
trained
a
low
accuracy
parser
(
called
DesrCombined
)
on
combined.con7
7
,
by
using
a
1st-order
averaged
perceptron
.
DesrCombined
was
used
to
parse
eng7ish_ptb_train.con77
,
the
original
training
corpus
for
English
.
By
comparing
this
parse
with
the
original
,
one
can
detect
where
such
parser
makes
mistakes
.
The
rationale
for
using
an
inaccurate
parser
is
to
obtain
parses
with
many
errors
so
that
they
form
a
suitably
large
training
set
for
the
next
step
:
parser
revision
.
We
then
used
a
parsing
revision
technique
(
At-tardi
and
Ciaramita
,
2007
)
to
learn
how
to
correct
these
errors
,
producing
a
parse
reviser
called
DesrReviser
.
The
revision
technique
consists
of
comparing
the
parse
trees
produced
by
the
parser
with
the
gold
standard
parse
trees
,
from
the
annotated
corpus
.
Where
a
difference
is
noted
,
a
revision
rule
is
determined
to
correct
the
mistake
.
Such
rules
consist
in
movements
of
a
single
link
to
a
different
head
.
Learning
how
to
revise
a
parse
tree
consists
in
training
a
classifier
on
a
set
of
training
examples
consisting
of
pairs
(
(
wi
,
d
,
wj
)
,
t1
)
,
i.e.
the
link
to
be
modified
and
the
transformation
rule
to
apply
.
Attardi
and
Ciaramita
(
2007
)
showed
that
80
%
of
the
corrections
can
be
typically
dealt
with
just
20
tree
revision
rules
.
For
the
adaptation
track
we
limited
the
training
to
errors
recurring
at
least
20
times
and
to
30
rules
.
DesrReviser
was
then
applied
to
pchemtb_test_base.desr
producing
pchemtb_test_rev.desr
,
our
final
submission
.
Many
conjunction
errors
were
corrected
,
in
particular
by
moving
the
head
of
the
sentence
from
a
coordinate
verb
to
the
conjunction
'
and
'
linking
two
coordinate
phrases
.
The
revision
step
produced
an
improvement
of
0.42
%
LAS
over
the
score
achieved
by
using
just
the
base
DeSRwsj
parser
.
Table
2
reports
the
official
accuracy
scores
on
the
closed
adaptation
track
.
DeSR
achieved
a
close
second
best
UAS
on
the
ptchemtb
test
set
and
third
best
on
CHILDES
.
The
results
are
quite
encouraging
,
particularly
considering
that
the
revision
step
does
not
yet
correct
the
dependency
labels
and
that
our
base
English
parser
had
a
lower
rank
in
the
multilingual
track
.
Table
2
.
Closed
adaptation
track
scores
.
Notice
that
the
adaptation
process
could
be
iterated
.
Since
the
combination
DeSRwsj+DesrReviser
is
a
more
accurate
parser
than
DeSRwsj
,
we
could
use
it
again
to
parse
pchemtb_un7ab1.con77
and
so
on
.
11
Conclusions
For
performing
multilingual
parsing
in
the
CoNLL
2007
shared
task
we
employed
DeSR
,
a
classifier-based
Shift
/
Reduce
parser
.
We
used
a
second
order
averaged
perceptron
as
classifier
and
achieved
accuracy
scores
quite
above
the
average
in
all
languages
.
For
proper
comparison
with
other
approaches
,
one
should
take
into
account
that
the
parser
is
incremental
and
deterministic
;
hence
it
is
typically
faster
than
other
non
linear
algorithms
.
For
the
adaptation
track
we
used
a
novel
approach
,
based
on
the
technique
of
tree
revision
,
applied
to
a
parser
trained
on
a
corpus
combining
sentences
from
both
the
training
and
the
adaptation
domain
.
The
technique
achieved
quite
promising
results
and
it
also
offers
the
interesting
possibility
of
being
iterated
,
allowing
the
parser
to
incorporate
language
knowledge
from
additional
domains
.
Since
the
technique
is
applicable
to
any
parser
,
we
plan
to
test
it
also
with
more
accurate
English
parsers
.
Acknowledgments
.
The
following
treebanks
were
used
for
training
the
parser
:
(
Aduriz
et
al.
,
2003
;
Bohmova
et
al.
,
2003
;
Chen
et
al.
,
2003
;
Ha-jic
et
al.
,
2004
;
Marcus
et
al.
,
1993
;
Marti
et
al.
,
2002
;
Montemagni
et
al.
2003
;
Oflazer
et
al.
,
2003
;
Prokopidis
et
al.
,
2005
;
Csendes
et
al.
,
2005
)
.
Ryan
McDonald
and
Jason
Baldridge
made
available
mstparser
and
helped
us
using
it
.
We
gratefully
acknowledge
Hugo
Zaragoza
and
Ricardo
Baeza-Yates
for
supporting
the
first
author
during
a
sabbatical
at
Yahoo
!
Research
Barcelona
.
